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FOREWORD 


The  Rocky  Mountain  symposia  on  microcomputers  are  organized  as  a forum 
for  presentation  and  discussion  of  basic  research  in  the  field  of  microcomputers. 
The  aim  is  to  encourage  a broad  base  of  interaction  between  the  industrial, 
governmental,  £md  academic  communities,  thus  providing  ^idance  and  feed- 
back for  the  directions  of  basic  research  and  for  the  effective  implementation  of 
research  achievements.  Promotion  of  such  cooperation  and  stimulation  of 
interest  in  these  objectives  should  have  major  impact  on  the  future  of  micro- 
computers and  their  uses. 

From  the  standpoint  of  these  objectives,  the  First  Annual  Rocky  Mountain 
Symposium  on  Microcomputers  has  proved  successful.  Participation  from 
government  and  academia  is  manifest  in  the  sponsorship  of  this  symposium 
and  in  the  research  papers  published  in  the  following  pages.  Participation  by 
the  industrial  community,  toward  whom  much  of  the  material  is  directed,  is 
especially  evident  in  the  audience  and  the  special  panel  discussion.  The  con- 
tinuation of  dialogue  between  these  communities  must  play  a critical  role  in 
the  future  of  microcomputer  development. 


Michael  Andrews 
Steve  McCormick 
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by 

W.  J.  Cody 

Abstract 

Microcomputers  are  now  capable  of  serious  numerical  computation  using 
programmed  floating-point  arithmetic  and  Basic  compilers.  Unless  numerical 
software  designers  for  these  machines  exploit  experience  gained  in  providing 
software  for  larger  machines,  history  will  repeat  with  the  initial  spread 
of  treacherous  software.  This  paper  discusses  good  software,  especially 
for  the  elementary  functions,  in  terms  of  reliability  and  robustness.  The 
emphasis  is  on  insight  rather  than  detailed  algorithms,  to  show  why  certain 
things  are  important  and  how  they  may  be  achieved. 

"What  history  and  experience  teach  us  is  this  — that  people  ... 
never  have  learned  anything  from  history j or  acted  on  principles  deduced 
from  it"  - Hegel 

"Those  who  cannot  remember  the  past  are  condemned  to  repeat  it"  - 
Santayana 

1.  Introduction 

We  are  now  witnessing  in  the  microcomputer  industry  an  accelerated 
repetition  of  the  development  of  computers  in  general.  Within  a short  time 
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microcomputers  have  achieved  memory  size,  wordlength.  Instruction  sets  and 
peripherals  with  capabilities  rivaling  those  found  on  minicomputers  just  a 
few  years  ago.  Microcomputer  software  now  Includes  operating  systems, 
algebraic  language  Interpreters  and  compilers,  graphics  packages  and  text- 
editing  programs.  Although  most  of  this  software  is  primitive,  within  a 
few  years  there  may  be  almost  no  practical  distinction  between  the  hardware 
and  software  capabilities  of  microcomputers  and  minicomputers,  except 
perhaps  speed. 

One  discouraging  aspect  of  this  development  is  that  much  of  the  emerg- 
ing software  ignores  the  lessons  learned  in  the  development  of  software  for 
larger  machines.  (New  software  for  the  larger  machines  often  Ignores  these 
lessons  as  well,  reaffirming  the  observations  of  Hegel  and  Santayana.) 

Once  disseminated,  inferior  software  Is  difficult  to  eradicate.  This  Is 
especially  true  for  numerical  software  — witness  the  lingering  death  of 
IBM's  Scientific  Subroutine  Package  [11].  Fortunately,  little  numerical 
software  Is  available  for  microcomputers  and  there  is  still  time  to  produce 
a decent  product  for  the  first  generation.  The  challenge  is  to  do  the  job. 

This  paper  discusses  problems  and  techniques  in  preparing  good  numeri- 
cal software  based  upon  experiences  with  larger  computers.  The  intent  in 
this  presentation  is  to  provide  insight  and  motivation,  to  show  why  certain 
things  are  Important  and  indicate  how  they  may  be  achieved.  Detailed  algo- 
rithms and  descriptions  of  good  numerical  software  for  a variety  of  tasks 
are  to  be  found  elsewhere,  and  are  not  included  here.  Section  2 begins  by 
considering  the  background  of  such  work.  Including  limitations  Imposed  on 
the  software  by  current  and  expected  microcomputer  hardware/software  environ- 
ments. Section  3 introduces  and  discusses  certain  desirable  attributes  of 
numerical  software.  Section  4 discusses  the  design  of  software  for  the 
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elementary  functions,  and  Section  5 briefly  discusses  other  items  of  numeri- 
cal software  that  will  probably  be  among  the  first  to  appear. 

2.  Background 


We  assume  that  floating-point  arithmetic  of  some  form  is  available. 
Numerical  software  can  and  does  exist  without  it,  e.g.,  software  for 
numerically  controlled  machine  tools,  but  most  scientific  computation  and 
data  reduction  is  best  done  with  It.  To  the  author's  knowledge  no  hardware 
floating-point  microcomputer  CPUs  exist  yet.  Floating-point  operations  are 
still  either  programmed  or  provided  on  a peripheral  arithmetic  device. 

The  most  important  consideration  in  floating-point  arithmetic  is  that 
it  be  "clean"  and  free  of  anomalous  behavior.  Clean  arithmetic  properly 
rounds  the  results  of  the  four  basic  arithmetic  operations,  generating 
necessary  guard  digits  in  intermediate  stages  of  the  operations  to  protect 
the  rounding.  It  is  also  free  of  mathematical  surprises  — numerical 
behavior  that  deviates  without  warning  from  the  expected  norm.  Anomalous 
behavior  is  often  associated  with  the  fringes  of  the  arithmetic  system  and 
considered  to  be  Ignorable  by  all  but  the  most  finicky  numerical  analysts, 
but  this  is  not  necessarily  the  case.  Some  of  the  most  insidious  examples 
can  affect  computations  far  removed  from  the  fringes  of  the  system.  For 
example,  on  a certain  line  of  minicomputers  it  is  often  true  that  A + A is 
correct,  but  2.0*A  is  incorrect  by  up  to  15  times  the  normal  rounding  error. 
This  discrepancy,  as  with  many  anomalous  behaviors,  is  the  result  of  a 
combination  of  unfortunate  design  decisions.  In  this  case  the  arithmetic 
is  hexadecimal  coupled  with  rounding  by  truncation  and  a lack  of  guard 
digits.  When  such  combinations  appear  on  short  wordlength  machines,  the 
results  of  even  simple  numerical  computations  of  any  length  must  be  suspect 
[16]. 
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The  set  of  computers  with  clean  floating-point  arithmetic,  hardwired  or 
programmed,  is  probably  empty  even  though  it  is  easy  to  design  such  an 
arithmetic.  When  floating-point  must  be  programmed  it  is  clearly  harder  and 
more  expensive  to  write  a clean  package  than  a slightly  dirty  one.  The 
clean  package  also  probably  occupies  more  space  and  executes  more  slowly. 

All  of  this  translates  into  a visible  expense  that  the  package  author  and 
user  are  each  anxious  to  eliminate  under  the  assumption  that  the  shortcuts 
taken  Introduce  only  minor  perturbations  from  the  expected  norm.  The  same 
considerations  apply  to  hardware  design.  Engineers  attempt  to  cut  expenses 
by  eliminating  guard  digits  and  proper  rounding,  or  by  substituting  clever 
circuitry  to  perform  almost  correct  arithmetic  in  more  efficient  ways.  A 
classic  example  is  the  large  scale  machine  which,  among  other  things,  treats 
floating-point  numbers  with  the  smallest  exponent  and  a legal  significand 
as  nonzero  for  addition  and  subtraction  operations,  but  which  for  engineering 
reasons  treats  them  as  zero  for  multiplication  and  divisions.  Thus  there 
exist  nonzero  X on  this  machine  for  which  1.0*X  is  zero.  Admittedly  these 
numbers  lie  on  the  fringe  of  the  arithmetic  system  in  this  case. 

Anomalies  can  also  arise  when  honest  efforts  are  made  to  Improve  the 
characteristics  of  a design.  For  example,  some  computers  carry  more  preci- 
sion in  the  active  arithmetic  registers  than  they  carry  in  stored  numbers, 
essentially  limiting  roundoff  error  in  any  computation  to  that  incurred  in 
storing  a number  away  and  recovering  it.  But  then  there  are  two  possibly 
different  values  of  a number,  the  one  in  the  registers  and  the  one  in  storage, 
which  can  lead  to  a situation  where  X-X  is  nonzero.  This  has  obvious  impli- 
cations for  testing  convergence  of  iterative  processes,  among  other  things. 

Even  with  clean  arithmetic  the  job  of  writing  good  numerical  software  is 
not  easy.  All  computations  must  still  be  carried  out  over  that  discrete 
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bounded  subset  of  the  real  number  system  dictated  by  the  floating-point 
representation  scheme.  Usually  the  set  is  closed  under  the  arithmetic 
operations  by  including  underflow  and  overflow.  But  the  arithmetic  opera- 
tors may  not  always  return  the  closest  element  to  the  "true"  result, 
introducing  roundoff  error.  There  are  other  peculiarities  traceable  to  the 
arithmetic  system  as  well.  The  common  hexadecimal  f loating-noint  representa- 
tion Introduces  "wobbling  precision,"  for  Instance,  in  which  there  may  be  as 
many  as  three  fewer  significant  bits,  almost  one  less  decimal  place  of 
significance,  in  the  representation  of  some  numbers  than  others.  This 
phenomenon  may  force  a reorganization  of  computation  to  avoid  poor  signifi- 
cance in  the  representation  of  intermediate  results.  For  example,  one 
algorithm  for  the  tangent  function  uses  the  fractional  part  of  X*(4/pi).  But 
the  hexadecimal  constant  4/pi  has  less  significance  than  the  hexadecimal 
constant  pi/4,  and  almost  one  extra  decimal  place  of  accuracy  can  sometimes 
be  obtained  by  using  the  fractional  part  of  X/(pi/4)  instead  [6]. 

Supportive  software  is  another  factor  to  consider.  The  cost  of  unclean 
arithmetic  can  be  compounded  by  poor  algebraic  language  interpreters  and 
compilers.  Consider  the  standard  Fortran  compilers  on  the  line  of  machines 
which  confuses  zero  and  nonzero  numbers.  Because  these  compilers  use 
floating-point  addition  and  subtraction  in  making  logical  comparisons,  a 
Fortran  statement  such  as 

IF  (X  .NE.  O.OEO)  Y = l.OEO  / X 

can  trigger  an  interrupt  for  division  by  zero  despite  the  obvious  effort  to 
avoid  that  possibility.  (The  "fix"  is  to  replace  X by  1.0E0*X  in  the 
logical  expression.) 

As  with  hardware  designs,  honest  efforts  to  improve  compilers  can  lead 
to  unexpected  problems.  Optimizing  compilers,  for  Instance,  often  greatly 
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Improve  Inefficient  code  written  by  Inexperienced  programmers;  on  the  other 
hand.  If  unrestrained,  they  sometimes  undermine  careful  code  written  by 
experienced  programmers.  It  Is  a common  practice  on  machines  without  guard 
; digits  to  rewrite  the  expression  X-1.0  as  (X-0.5)-0.5  to  preserve  accuracy 

I _ 

for  X slightly  less  than  1.0.  There  Is  at  least  one  compiler  which  help- 
fully combines  constants  whenever  possible,  thus  replacing  the  second 
expression  with  the  first.  This  optimization  cannot  be  suppressed;  It  must 
be  avoided  by  programming  subterfuge. 

I The  list  of  hardware  and  software  related  pitfalls  Is  a long  one  which 

I 

we  will  not  pursue  further.  The  Interested  reader  may  consult  more  detailed 
discussions  such  as  (3,10,13).  The  point  we  have  tried  to  make  Is  that 
r there  are  peculiarities  In  existing  hardware  and  supportive  software  that 

affect  numerical  computations  In  subtle  ways. 

Good  numerical  software  can  be  written  using  unclean  arithmetic  and 
Imperfect  compilers;  that  Is  not  an  Issue.  Any  anomaly  can  be  avoided  by 
appropriate  programming  provided  the  anomaly  Is  known  and  provided  Its 
avoidance  Is  Important.  The  problem  Is  that  known  anomalies  constantly 
Inspire  evasive  programming  action,  and  unsuspected  anomalies,  or  anomalies 
whose  Importance  Is  not  anticipated,  frustrate  even  experienced  programmers. 
The  money  saved  In  the  design  of  less-than-clean  arithmetics  and  compilers 
Is  expended  many  times  over  In  attempting  to  write  self-lmmunlzlng  software 
or  In  naive  reliance  upon  numerical  results  obtained  from  non- Immunized  ' 
software.  This  tremendous  cost  Is  hidden.  Clean  arithmetic  and  good  com- 
; pliers  simply  make  the  Job  of  providing  good  numerical  software  easier  and 

j cheaper. 

s For  the  record,  we  describe  a simple  floating-point  arithmetic  that  we 

feel  would  facilitate  the  preparation  of  good  numerical  software.  We  start 


with  a sign-magnitude  representation  so  that  every  number  can  be  negated, 
and  a binary  or  decimal,  but  definitely  not  hexadecimal,  radix.  The  primary 
working  precision  is  fourteen  to  eighteen  decimal  places  of  significance  with 
an  exponent  range  at  least  ten  times  the  significance,  l.e.,  if  the  arith- 
metic carries  S significant  digits  it  can  represent  numbers  with  exponents 
up  to  lOS.  The  exponent  range  is  balanced  so  that  almost  every  number  can 
be  reciprocated.  Underflow  and  overflow  interrupts  are  precise,  with  replace- 
ment of  underflow  by  zero  and  continued  execution  possible.  Rounding  is 
"clean"  with  the  equivalent  of  two  guard  digits,  and  rounding  by  truncation 
is  optional.  Finally  the  results  of  arithmetic  operations  carry  the  same 
precision  as  stored  quantities.  A double  working  precision  with  a compatible 
representation  would  also  be  useful,  although  there  is  an  inconsistency  in 
our  design  philosophy  when  we  double  the  precision  without  modifying  the 
exponent  range. 

Such  a floating-point  arithmetic  is  not  available  anywhere.  It  is 
possible  to  achieve  much  of  this  on  a microcomputer,  although  the  suggested 
precision  and  exponent  range  may  be  unrealistic  for  this  type  of  machine. 

Less  precision  and  range  is  acceptable,  however,  if  the  rest  of  the  design 
is  good.  The  worst  combination  would  be  short  signlflcand  and  short  exponent 
range  coupled  with  hexadecimal  representation  and  a truncating  arithmetic 
without  guard  characters  [16]. 

e arithmetic  we  are  most  likely  to  see  on  microcomputers  for  the  next 
few  years  is  exemplified  by  the  software  floating-point  package  for  the 
INTEL  8008  and  8080  in  use  at  Lawrence  Livermore  Laboratory  [17],  and  the 
modified  calculator  chip  recently  announced  by  National  Semiconductor 
Corporation  [18].  The  software  package  is  binary,  while  the  chip  is  decimal. 
Each  provides  about  eight  decimals  of  precision,  but  the  software  only 
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provides  decimal  exponents  up  to  roughly  20,  while  the  chip  carries  decimal 
exponents  up  to  99.  Each  provides  the  four  basic  operations  plus  a square 
root,  with  the  chip  from  four  to  ten  times  slower  than  the  software.  The 
author  does  not  know  the  details  of  the  arithmetic  in  either  case,  but  sus- 
pects the  worst.  The  importance  of  the  chip  is  that  it  also  provides  the 
usual  array  of  elementary  functions,  constants  and  conversions  found  on  hand 
calculators.  We  will  discuss  the  Implications  of  that  in  more  detail  later. 

j 

3.  Attributes  of  Good  Numerical  Software 

I 

We  define  an  item  of  numerical  software  as  a running  documented  computer 
program  available  in  a particular  computer  environment.  This  distinguishes 
software  from  an  algorithm  printed  in  a journal  or  a numerical  method  des- 
cribed in  a textbook.  The  item  of  software  probably  implements  such  an 
algorithm  or  numerical  method,  but  it  is  a separate  entity  not  to  be  confused 
with  the  others.  It  has  characteristics  all  its  own  completely  independent 
of  the  underlying  algorithm  [20].  The  same  basic  algorithm  may  be  imbedded 
in  several  different  programs  differing  widely  in  details  of  organization 
and  numerical  behavior,  therefore  differing  in  those  characteristics  affecting 
performance.  The  important  new  ingredient  in  the  successful  implementation 
of  an  algorithm  is  a detailed  knowledge  of  the  arithmetic  system  and  the 
supportive  programming  systems  to  be  exploited.  We  saw  in  the  last  section 
that  there  are  times  when  2.0*X  is  best  calculated  as  X+X.  Attention  to 
subtleties  such  as  this  may  make  the  difference  between  a very  useful  imple- 
mentation and  something  less. 

Our  discussion  will  concentrate  on  only  those  software  attributes 
related  to  performance.  Ideally  we  would  like  to  have  numerical  software 
that  is  accurate  and  efficient,  and  resilient  under  misuse.  We  would  like 
to  have  numerical  programs  that  can  be  trusted  to  accept  our  data,  operate 
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upon  it,  and  either  return  valid  numerical  results  or  an  explanation  of  why 
they  cannot  be  obtained.  Such  programs  are  described  as  reliable  and  robust. 

Reliability  refers  to  the  ability  of  the  program  to  perform  a well- 
defined  computation  both  accurately  and  efficiently.  Of  course,  reliability 
starts  with  the  choice  of  the  proper  algorithm,  but  it  is  primarily  an 
attribute  of  the  software.  Reliable  software  successfully  handles  the 
problem  set  defined  by  the  underlying  numerical  analysis,  realizing  accuracy 
over  that  problem  set  close  to  the  theoretical  prediction.  Reliability  is 
relative,  and  Improper  appreciation  of  the  computer  environment  may  degrade 
reliability  by  restriction  of  either  the  problem  set  or  the  obtainable  accuracy. 

Robustness  refers  to  the  ability  of  a program  to  avoid  or  gracefully 
recover  from  computational  difficulties  without  unnecessary  interruption  of 
program  execution.  Consider  the  problem  of  underflow  for  example.  In  most 
cases  when  underflow  occurs  an  error  message  is  generated  and  then  execution 
continues  with  a zero  result.  If  the  underflow  significantly  alters  the  final 
computed  result  it  is  destructive;  otherwise  it  is  non-destructive.  The 
usual  underflow  error  message  is  annoying  to  a user  because  it  does  not  tell 
him  what  he  wants  to  know,  namely  whether  the  underflow  was  destructive  or 
non-destructive.  Robust  software  is  completely  free  of  underflow,  so  the 
question  of  its  Importance  never  arises.  To  achieve  this,  the  computation 
is  restructured  wherever  possible,  consistent  with  the  requirements  for 
reliability,  to  avoid  expressions  that  might  cause  underflow.  Should  the 
possibility  of  underflow  be  unavoidable,  tests  are  made  for  it  ahead  of  time 
and  appropriate  remedial  action  is  taken  if  it  is  detected.  Non-destructive 
underflow  is  quietly  bypassed  and  execution  continues  with  a zero  result; 
destructive  underflow  is  bypassed  with  an  error  return,  including  precise 
diagnostic  information.  Other  types  of  error  conditions  that  might  arise 
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are  treated  in  an  analogous  manner. 


4.  The  Elementary  Functions 

Among  the  first  Important  Items  of  numerical  software  to  appear  In  any 
computer  system  are  subprograms  for  elementary  and  algebraic  functions. 

These  functions  are  considered  to  be  so  Important  that  they  are  often  pre- 
defined In  algebraic  programming  languages  such  as  Basic  and  Fortran. 

Special  libraries  of  appropriate  subprograms  then  accompany  compilers  for 
these  languages. 

Fundamental  as  these  functions  are,  we  will  see  that  the  programs  Imple- 
menting them  sometimes  contain  gross  blunders  that  go  undetected  for  years. 

It  Is  still  rare  for  an  elementary  function  library  to  contain  programs  of 
uniformly  high  quality  even  though  the  techniques  for  writing  such  programs 
have  been  known  and  practiced  by  some  Individuals  for  over  fifteen  years. 
Until  now  most  of  this  knowledge  has  appeared  In  small  bits  and  pieces  widely 
scattered  In  the  literature,  and  the  average  system  programmer  assigned  the 
task  of  vnrltlng  the  library  programs  has  had  to  rely  on  his  own  often  meager 
knowledge  of  calculus  and  numerical  analysis.  This  background  Is  Insuffi- 
cient for  the  preparation  of  reliable  and  robust  function  programs.  Much 
of  the  following  discussion  Is  based  upon  a forthcoming  software  manual  for 
the  elementary  functions  [7]  designed  to  assist  Just  such  systems  pro- 
grammers In  the  preparation  of  better  programs  than  they  could  probably 
write  by  themselves. 

A typical  algorithm  for  evaluating  an  elementary  function  consists  of 
three  distinct  steps.  The  first  accepts  an  arbitrary  argument  within  the 
function  domain  and  reduces  It  to  a related  argument  In  a primary  domain, 
plus  some  additional  parameters.  The  second  evaluates  the  function,  or  a 

10 


related  function,  for  the  reduced  arg^ent.  The  third  then  combines  the 
computed  function  value  with  the  additional  parameters  to  reconstruct  the 
desired  function  value  of  the  original  argument. 

As  an  example,  an  algorithm  for  the  evaluation  of  slne(x),  where  x Is 
expressed  in  radians,  might  be  based  on  the  following  analysis.  Let 
X = N*pl+f  where  |f | £ pl/2.  Then 

slne(x)  » slgn(x)  * sine(f)  * (-1)^  , 

The  first  step  in  the  algorithm  la  to  determine  N and  f given  x.  The  second 
step  is  to  calculate  slne(f ) , probably  using  a truncated  Taylor  series,  a 
Padd  approximation  or  perhaps  even  a minimax  rational  approximation.  The 
final  step  is  to  reconstruct  sine(x)  from  N and  slne(f). 

We  consider  the  first  two  steps  in  more  detail  to  see  how  they  may  be 
designed  for  reliability  and  robustness.  Our  basic  assumption  is  that  the 
given  argument  x is  exact.  This  is  not  often  the  case  in  practical  situa- 
tions, but  a program  as  Important  as  an  elementary  function  routine  should  be 
designed  to  satisfy  the  most  demanding  user,  and  there  are  users  with  exact 
Integer  arguments.  This  emphasis  on  accuracy  cannot  be  justified  if  the 
extra  cost  is  excessive,  but  in  most  cases  the  more  accurate  routine  costs 
only  a few  percent  more  to  implement  and  use  than  the  less  accurate  one. 

Under  the  assumption  of  exact  arguments  the  argument  reduction  step  is 
critical  in  preserving  accuracy.  For  the  sine  routine  the  expression 

f “ X - N*pi 

must  be  evaluated  carefully  lest  there  be  a loss  of  significance  in  f asso- 
ciated with  taking  the  difference  of  two  nearly  equal  quantities.  The 
obvious  step  is  to  extend  x to  double  working  precision,  a representation 
that  is  again  exact,  and  calculate  f in  this  higher  precision.  This  is 
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clearly  too  expensive  to  consider  In  most  cases,  but  much  the  same  effect 
can  be  achieved  by  careful  reformulation  of  the  calculation  In  the  working 
precision.  Let  Cl  and  C2  be  two  constants  whose  sum  represents  pi  to 
beyond  the  working  precision,  and  let  Cl  be  exactly  representable  In  less 
than  working  precision.  Cl  * 201/64  works  for  most  non-decimal  machines. 

Then  calculate 

f = (x  - N*C1)  - N*C2  . 

Now  the  loss  of  leading  significant  digits  In  the  first  term  Is  compensated 
by  a gain  In  trailing  significant  digits  In  the  second  term,  provided  the 
product  N*C1  can  be  exactly  represented  In  the  registers,  l.e.,  provided 
N,  hence  x.  Is  not  too  large.  The  added  expense  Is  one  stored  constant  and 
two  operations. 

The  restriction  on  the  size  of  x or  N appears  to  be  an  added  limitation 
on  the  domain  of  the  function  routine,  but  this  Is  not  really  so.  As  x 
becomes  large  In  magnitude  the  machine  representation  of  x approaches  an 
Integer  multiple  of  the  machine  representation  of  pi  and  there  Is  little 
significance  In  f,  hence  little  significance  In  the  computed  sine.  A robust 
routine  should  warn  the  user  of  this  situation  rather  than  provide  a random 
number  for  a function  value.  The  limitation  on  the  size  of  N merely  clari- 
fies the  situation  and  establishes  a reasonable  boundary  for  the  largest 
acceptable  argument.  Near  this  boundary  point  the  function  has  already 
become  "grainy",  l.e.,  arguments  which  differ  by  only  one  unit  In  the  least 
significant  bit  position  return  function  values  which  agree  to  only  a few 
leading  significant  bits. 

Argument  reduction  steps  for  other  functions  differ  In  detail  but  follow 
the  same  general  pattern.  In  almost  every  case  careful  handling  of  critical 
computations  Involving  a mathematical  constant  will  pay  big  dividends  In 
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accuracy  with  minimal  expense.  This  assumes  that  the  argument  Is  exact, 
because  there  Is  no  way  to  compensate  for  an  unknown  error  In  the  argument. 
Table  I shows  the  difference  careful  argument  reduction  can  make  for  selected 
function  computations  on  an  H.F.  65  hand  calculator.  This  machine  Is  an 
older  programmable  10-dlglt  decimal  machine.  In  the  cases  cited  up  to  half 
of  the  precision  has  been  lost  In  the  usual  argument  reduction  which  Is 
hardware  on  this  machine,  but  all  figures  are  correct  when  carefully  pro- 
grammed argument  reduction  Is  used  (later  H.F.  calculators  also  return  full 
precision  results) . The  similar  loss  of  half  of  the  precision  In  a nominal 
6-dlglt  microcomputer  can  be  serious. 


Table  I 

Effect  of  careful  argument  reduction  In  selected  cases 
on  an  H.F.  65  hand  calculator 


Function 

Sln(22) 

Tan (11) 

Ln(l.OOll) 

"True"  Value 

-8.8513  09290  E-3 

-225.95084  65 

9.9999  50000  E-6 

"Hardware"  Value 

-8.8513  06326  E-3 

-225.95092  46 

9.9999  00000  E-6 

Careful  Argument 
Reduction 

-8.8513  09290  E-3 

-225.95084  64 

9.9999  50000  E-6 

The  argument  reduction  step  Is  often  the  best  place  for  two  related 
programs  to  be  merged.  The  sine  routine  can  serve  double  duty,  for  example, 
by  being  used  for  the  computation  of  the  cosine  as  well.  Because 

cosine (x)  ■ slne(x+pl/2) 

we  could  calculate  the  cosine  by  adding  pl/2  to  the  argument  and  then  calculat- 
ing the  sine.  But  we  must  be  carefull  Careless  addition  of  pl/2  will  negate 
an  accurate  argument  reduction.  We  can  retain  full  significance  In  f If 
Instead  of  adjusting  x we  adjust  the  value  of  N for  the  computation  of  f. 
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Details  are  to  be  found  in  [7]  If  they  are  not  obvious. 

The  second  step  allows  the  programmer  a great  deal  of  freedom  within 
limitations  Imposed  by  accuracy.  In  our  sine  routine,  as  in  most  routines, 

i 

the  most  useful  approximations  come  from  the  family  of  rational  functions,  a | 

typical  member  being 

R „(f)  = P„(f)  / Q„(f) 

mn  m n 

where  and  are  pol3momials  of  degree  m and  n,  respectively.  The  special  ^ 

I 

case  n=0  corresponds  to  pure  polynomial  forms  such  as  truncated  Taylor  ! 

t 

series.  In  true  rational  forms,  those  with  both  m and  n nonzero,  one  coeffi- 
cient may  be  chosen  arbitrarily  to  satisfy  additional  numerical  constraints,  j 

all  other  coefficients  being  scaled  accordingly.  Such  scaling  is  often  used 
to  avoid  loss  of  significance  associated  with  wobbling  precision  on  hexa- 
decimal machines,  or  to  save  one  multiplication  in  the  evaluation  of 

q„(f)  . ((<.„*  £*Vl>  * V2>  ‘ 

by  setting  * 1.0.  True  rationals  may  also  be  rewritten  as  truncated 
continued  fractions  to  avoid  roundoff  in  the  usual  representation,  or  broken 
into  a constant  plus  a rational  correction  term  to  better  represent  a slowly 
varying  function.  These  and  other  possible  maneuvers  make  true  rational 
forms  generally  superior  to  polynomials  for  achieving  reliability. 

Approximations  are  most  efficient  when  they  preserve  basic  analytic 

properties  of  the  function  being  approximated.  In  our  example,  slne(f)  is 

an  odd  function,  i.e.,  sine(-f)  =■  -sine(f) . This  property  is  preserved  by 

2 

rationals  of  the  form  f *R(f  ) , and  most  sine  programs  use  such  an  approxima- 
tion. But  programs  which  use  the  same  approximation  for  all  values  of  f may 
not  be  robust,  because  the  Intermediate  quantity  f can  underflow  for  |f|  { 

2 i 

sufficiently  small.  If  f is  replaced  by  zero  when  underflow  occurs,  the 


14 


underflow  Is  probably  non-destructive,  but  the  resulting  arithmetic  interrupt 
and  error  message  are  still  annoying.  A robust  sine  program  avoids  the 
underflow  entirely  by  trapping  out  |f|  less  than  some  threshold,  say  2**(-t/2) 
where  there  are  t bits  in  the  signlficand  of  a floating-point  number,  and 
returning  f for  the  function  value.  Again  the  added  expense  for  a clean 
program  is  minimal. 

The  chip  mentioned  in  the  previous  section  undoubtedly  portends  the  next 
major  modification  to  machine  hardware  for  scientific  computation.  This 
particular  chip  is  Intended  for  use  with  microcomputers,  but  there  are  clear 
indications  that  similar  hardware  or  microprogrammed  implementations  of  the 
elementary  functions  are  also  under  consideration  for  maxicomputers  [19]. 
Comparing  the  complexity  of  these  functions  with  the  complexity  of  floating- 
point arithmetic,  and  considering  the  lack  of  clean  arithmetic  hardware, 
there  is  genuine  cause  for  concern  about  the  quality  of  functions  implemented 
in  this  way. 

Calculator  chips  are  our  primary  examples  of  this  type  of  technology. 

Most  of  these  use  a CORBIC  scheme  [21]  in  the  second  step  of  the  general 
computation  outlined  above.  This  Involves  a continued  product  expansion  of 
the  function  Instead  of  the  usual  rational  approximation.  The  CORDIC  method 
is  fast  and  accurate  in  most  cases,  the  major  exception  being  the  accuracy  of 
logarithms  for  arguments  close  to  1.0.  Chips  in  recent  Hewlett-Packard  cal- 
culators are  even  accurate  in  this  case,  but  the  details  of  the  scheme  used, 
presumably  a modified  CORDIC,  are  secret.  Outside  of  the  primary  range  the 
accuracy  of  the  function  still  depends  upon  the  argument  reduction,  a fact 
most  chip  designers  Ignore.  Only  a few  calculators,  again  including  recent 
Hewlett-Packard  models,  are  as  careful  in  argument  reduction  as  we  have 
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Verifying  the  reliability  and  robustness  of  a function  program  is  as 
delicate  a task  as  writing  the  program  in  the  first  place.  Consider  the 
simple  problem  of  accuracy  testing.  The  objective  is  to  measure  the  error 
in  computed  function  values  given  exact  arguments.  The  problem  is  to  devise 
a testing  procedure  which  can  do  this  without  introducing  additional  error, 
systematic  or  random.  The  best  testing  procedures  Involve  direct  comparisons 
against  higher  precision  computations,  but  such  techniques  are  complicated 
and  often  difficult  to  Implement,  especially  when  the  higher  precision  arith- 
metic must  first  be  programmed.  A second  class  of  testing  procedures 
measures  the  error  in  selected  mathematical  identities.  These  procedures 
are  less  discerning  than  the  first  ones,  but  are  easily  Implemented.  Care- 
fully done,  test  programs  based  on  identities  distinguish  between  good  and 
bad  function  programs  and  provide  error  statistics  only  slightly  inferior  to 
those  provided  by  tests  of  the  first  kind. 

As  an  example,  one  procedure  for  testing  the  sine  function  measures  the 
relative  error 

E - [slne(x)  - sine(x/3)  (3  - 4sine^(x/3)) ] / sine(x) 

in  the  triple  angle  formula.  Arguments  are  drawn  randomly  from  Intervals 
selected  to  minimize  the  subtraction  error  in  E and  to  focus  the  accuracy 
test  on  one  particular  aspect  of  the  sine  routine.  Arguments  from  the  inter- 
val (0,pl/2)  are  used  to  test  the  routine  when  no  argument  reduction  is 
needed,  i.e.,  to  test  the  basic  computation  of  sine(f)  for  the  reduced  argu- 
ment f;  and  arguments  from  the  Interval  (6pl,13pi/2)  are  used  to  test  the 
accuracy  of  the  argument  reduction  scheme.  In  both  cases,  unless  x and  x/3 
are  exact  machine  numbers,  however,  test  results  are  contaminated  to  the 
point  where  they  are  useless.  There  are  simple  ways  for  assuring  that  these 
exactness  conditions  hold,  but  we  do  not  want  to  get  into  detailed  numerical 
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analysis  here.  We  urge  the  interested  reader  to  consult  [7]  for  more  infor- 
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mation  on  these  and  other  useful  tests  for  the  sine  function,  including  a 
complete  Fortran  test  program. 

Table  II  gives  results  from  selected  runs  of  that  test  program  on  several 
different  computer  systems.  An  examination  of  the  error  statistics  indicates 
which  systems  have  sine  routines  with  good  basic  approximations,  good  argument 
reduction  schemes,  and  good  alternate  entries  for  the  cosine  function.  Tests 
were  also  run  using  Basic  1.1  on  the  Datapolnt  2200,  where  a curious  blunder 
was  detected  in  the  sine  routine.  Computations  were  reasonably  good  for 
lx|  < 2pi,  but  the  routine  returned  the  value  of  sine(4x)  for  x > 2pi.  This 
blunder  was  not  detected  by  the  identity  tests  described  above,  but  was 
detected  by  a different  test  designed  for  that  specific  purpose. 

Table  II 

Results  of  random  argument  accuracy  tests  of  sine  and 
cosine  routines  based  on  triple  angle  formulas.  All 
tests  were  2000  uniformly  distributed  arguments  in  each 
interval.  MRE  is  MAX(ABS(E)),  the  maximum  absolute 
value  of  the  relative  error  E,  and  RMS  is  the  root- 
mean-square  value  of  the  relative  error  E. 


Machine, 

Library,  and 

FI.  Ft. 

Slsnlf icance 

slne(x) 

(O.pi/2) 

Test  and  Interval 

slne(x) 

(6pi.l3pi/2) 

cosine(x) 

(7pi,15pi/2) 

IBM  370/195 
Extended  Fortran 

MRE 

16**(-4.84) 

16** (-4. 84) 

16**(-4.82) 

6 Hex 
(24  bits) 

RMS 

16**(-5.30) 

16**(-5.31) 

16**(-5.31) 

PDF  11/45 

Fortran 

MRE 

2**(-22.06) 

2**(-22.26) 

2**(-11.37) 

DOS  8.02 
(24  bits) 

RMS 

2**(-23.90) 

2**(-23.91) 

2**(-16.34) 

Varlan  72 

Fortran  E3 

MRE 

2**(-20.13) 

2**(  -8.46) 

2**(  -9.31) 

(22  bits) 

RMS 

2**(-22.20) 

2**(-13.45) 

2**(-14.69) 

r ' ' 

* As  we  Indicated  earlier  it  is  not  often  that  the  data  used  is  precise 

enough  to  justify  our  approach  to  accuracy.  The  real  benefits  are  intangi- 
ble. Kuki  states  that  the  major  benefit  is  psychological  [15].  The  average 
user  places  great  faith  in  the  basic  library  programs,  just  as  he  trusts  the 
arithmetic  operations.  He  usually  looks  elsewhere  for  an  explanation  when 
computational  error  is  unacceptably  large.  Programs  written  as  carefully  as 
we  suggest  can  be  demonstrated  to  be  accurate,  building  the  user's  confidence. 

Less  carefully  written  programs  can  be  demonstrated  to  be  Inaccurate  on  exact  |i 

data,  destroying  the  user's  confidence,  even  though  they  may  be  perfectly 
acceptable  for  processing  his  data.  The  penalty  with  the  less  accurate  rou- 
tine is  that  the  user  may  be  misled  in  his  search  for  errors  in  his  computa- 
! tlon.  In  extreme  cases  he  may  even  provide  his  own,  often  inferior, 

replacements . 

5.  Other  Common  Numerical  Software 

It  is  much  easier  to  write  good  elementary  function  programs  than  it  is 
to  write  good  numerical  software  for  other  purposes.  Elementary  functions 
are  simple  mappings  of  one  one-dimensional  subset  of  the  real  numbers  into 
another.  Usually  there  are  only  a few  paths  through  an  elementary  function 
routine,  and  it  is  the  details  of  implementation,  not  the  choice  of  algorithm, 
that  decide  how  reliable  and  robust  the  routine  will  be.  By  contrast,  most 
other  mathematical  processes  of  Interest,  even  those  representable  in  simple 
mathematical  terms,  are  complicated  mappings  involving  subsets  of  n-space. 

' There  are  many  paths  through  a typical  subroutine  for  such  a process.  In 

this  case  it  is  primarily  the  choice  of  algorithm  that  determines  reliability 
and  robustness  of  software  and  not  so  much  the  details  of  Implementation, 

I although  the  latter  cannot  be  Ignored  completely. 
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We  do  not  expect  microcomputers  to  be  called  upon  very  soon  to  do 
matrix  elgenanalysls  or  solve  stiff  differential  equations,  but  they  probably  i 

will  be  expected  to  do  such  simple  tasks  as  summing  data,  performing  linear  | 

regression,  calculating  the  Euclidean  norm  of  a vector,  and  solving  quad-  | 

! 

ratlc  equations.  In  each  case  the  obvious  algorithm  can  give  reasonable  | 

looking  yet  Incorrect  results  without  warning.  The  following  example  Is  | 

j 

adapted  from  Kahan  [14]. 

Linear  regression  Is  the  process  of  fitting  a straight  line  to  data  In 
the  least  squares  sense.  A typical  regression  program  In  a microcomputer 
might  be  expected  to  automatically  receive  data  from  an  online  experiment 
until  the  end  of  the  experiment  Is  signalled,  and  then  return  the  regression 
coefficients  and  perhaps  the  mean  and  standard  error  for  the  data.  The 
classical  equations  are 

Y * Mx  + B 

Exy  - TxSy/n 
Ex  - (Ex)^/n 

B * (Ey  - MEx) /n 

X ” Ex/n 

and 

s^  - [Ex^  - (Ex)^/n]/(n-l) 

where  n Is  the  number  of  data  points  (x,y),  x Is  the  mean,  s^  Is  the  standard 
deviation,  and  Y Is  the  value  predicted  by  the  linear  regression  [8] . A 
number  of  chips  now  used  In  hand  calculators  facilitate  the  use  of  these 
equations  by  providing  one  Instruction  which  automatically  augments  each  of 
the  sums  when  given  a new  (x,y)  data  pair.  There  will  naturally  be  a tendency 
to  use  these  classical  equations  should  such  chips  become  available  on  micro- 
computers. Yet  these  equations  can  lead  to  surprising  problems.  For  example, 
consider  an  Implementation  In  single  precision  arithmetic  on  the  IBM  370/195. 
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This  machine  uses  hexadecimal  floating-point  arithmetic  with  a 24  bit  signi- 
flcand,  i.e.,  it  nominally  represents  numbers  to  at  least  7 significant 
decimal  places.  However,  given  the  two  points  (1365.0,  1.0)  and  (1366.0,  2.0), 
the  program  returns  the  grossly  incorrect  values 


Calculated 

True 

M 

0.5 

1.0 

B 

-681.25 

-1364.0 

X 

1365.5 

1365.5 

s 

X 

1.0 

1//2 

without  any  indication  of  a malfunction.  The  problem  in  this  particular  case 
2 

arises  because  Zx  /n  cannot  be  represented  exactly  in  the  machine,  even 
2 

though  Zx  can. 


I 


A slight  modification  of  the  computation  will  give  correct  results  for 
this  data,  i.e.,  we  can  simplify  the  expression  for  M by  multiplying  the 
numerator  and  denominator  by  n,  but  that  is  not  the  important'  point.  We  are 
treating  the  symptom  and  not  the  disease  when  we  do  that.  The  new  formula- 
tion will  Inevitably  cause  trouble  with  slightly  different  data.  Adding  the 
point  (1367.0, .3 .0)  results  in  an  error  exit  for  division  by  zero,  for 
Instance,  because  now  the  two  terms  in  the  denominator  agree  to  machine 
precision.  The  real  difficulty  is  that  the  classical  equations  are  Inherently 
unreliable  for  numerical  computation  on  a short  wordlength  machine.  We  must 
seek  a different  algorithm. 

2 

The  equations  for  M and  s^  can  be  rewritten  as 

M - gCx-x)^y-7) 

Z (x-x) ^ 

s^  - Z(x-x)^/(n-l) 
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which  scales  things  so  that  none  of  the  Intermediate  results  becomes  too  large 
In  comparison  to  the  wordlength.  Of  course,  this  does  not  help  much  If  the 
computation  of  x and  y now  requires  that  all  data  be  kept  In  storage  until 
the  value  of  n Is  known.  Fortunately  that  Is  not  the  case;  running  estimates 
of  “ i x^/k  and  of  aj^  ••  J (x^-Xj^)  can  be  calculated  by  recurrence  methods 
as  follows  [8,14],  Let  Xj^  « x^  and  a^  ■ 0.  Then 

- V 

and 

a^^^  - a^  + k(kfl)[(Xj^^  - ^)/(k+l)]^  . 

This  approach  allows  the  computation  of  linear  regression  coefficients 
essentially  to  within  roundoff  In  the  coefficients  regardless  of  the  number  of 
terms.  Further,  the  regression  coefficients,  means  and  standard  deviations 
can  be  obtained  at  any  Intermediate  point  without  disturbing  subsequent 
computations. 

There  are  similar  algorithmic  problems  associated  with  each  of  the  other 
mathematical  processes  mentioned  earlier  [1,9,12].  All  have  been  solved,  and 
experienced  numerical  analysts  are  aware  of  the  solutions.  But  the  computa- 
tions appear  to  be  so  trivial  mathematically  that  computer  users  are  constantly 
writing  their  own  software  for  these  tasks  completely  unaware  that  there  are 
problems. 

It  Is  somehow  Ironic  that  the  average  user  providing  his  own  software  Is 
often  no  worse  off  than  his  colleague  who  relies  naively  on  the  software  pro- 
vided for  him  In  a library.  There  appears  to  be  no  remedy  for  the  situation. 
But  microcomputer  users  are  not  alone;  we  are  all  victims  of  Ignored  history. 
The  real  challenge  In  numerical  software  was  known  to  Hegel  and  Santayana. 
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Abstract 

The  increasingly  sophisticated  applications  for  microcomputer  systems 
require  an  efficient  and  reliable  general-purpose  floating-point  facility. 

This  paper  focuses  attention  on  numerical  behaviorial  problems  that  should  be 
considered  both  when  designing  and  using  floating-point  arithmetic  on  micro- 
processors. Topics  discussed  include  tradeoffs  amongst  floating-point  repre- 
sentations, deviations  of  behavior  from  that  exhibited  by  the  real  number 
system,  numerical  algorithmic  difficulties,  and  testing  techniques  for 
assessing  the  performance  of  floating-point  computations.  A comprehensive  bib- 
liography is  also  included  which  serves  as  a guide  to  the  literature  on  floating- 
point arithmetic  and  its  influences  on  mathematical  software  development. 

1.  Introduction 

The  increasingly  sophisticated  applications  for  present  and  future  micro- 
processor systems  require  an  efficient  and  reliable  floating-point  facility  with 
an  extensive  support  library  of  practical  functions  that  are  easily  accessible  to 
users;  work  in  this  area  is  still  in  an  embryonic  state  of  development  and  could 
therefore  benefit  from  the  floating-point  experience  acquired  from  efforts  on 
larger  machines.  This  paper  focuses  attention  on  some  of  the  numerical  problems 
involved  in  selecting  and  using  floating-point  arithmetic  on  microcomputers.  It 
is  hoped  that  knowledge  of  floating-point  behavior  traits  can  benefit  the  micro- 
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processor  conununity  by  helping  its  members  to  understand  and  anticipate  prob- 
lems in  designing  and  using  general-purpose  computer  arithmetic  with  future 
systems. 


The  creation,  utilization,  and  testing  of  a floating-point  capability  in-  | 

volves  many  factors  with  a myriad  of  interactions.  Throughout  this  paper  we  shall  | 
indicate  numerous  behaviorial  features  and  how  to  detect  and  cope  with  them.  In  | 

Section  2,  floating-point  representations  are  examined  along  with  tradeoffs  amongst  | 

base  2,  8,  and  16  arithmetic.  The  next  section  deals  with  discrepancies  between  | 

floating-point  behavior  and  that  exhibited  by  the  real  number  system.  The  pen-  ] 

ultimate  section  is  devoted  to  numerical  difficulties  associated  with  certain  I 

algorithmic  features.  Testing  techniques  to  assess  the  reliability  of  a computer's 
floating-point  facility  are  discussed  in  the  final  section.  The  Interested  reader 
seeking  further  information  concerning  the  subject  matter  of  this  paper  is  encouraged 
to  examine  the  comprehensive  bibliography  in  the  appendix;  it  contains  references  to 
works  on  the  behavior  of  various  floating-point  arithmetic  systems  and  other  com- 
putational environmental  Influences  affecting  the  development,  testing,  and  use 
of  reliable  mathematical  software. 

2 . Examples  of  Floating-Point  Representations 

The  internal  floating-point  representation  in  most  computer  systems  consists 
of  the  components  shown  in  Figure  1;  In  large  machines,  the  floating-point  format 
is  contained  in  a single  word  whereas  several  words  are  usually  needed  for  a mini— 
or  microcomputer  due  to  the  shorter  word  length.  S is  the  sign  bit  and  indicates 
whether  the  number  is  positive  or  negative.  The  mantissa  is  composed  of  a T- 
digit  fraction  expressed  in  a number  base,  8,  called  the  radix,  which  is  selected 
for  the  internal  floating-point  design;  the  radix  point  is  usually  assumed  to  be 
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to  the  immediate  left  of  the  first  B-ary  digit,  d^^,  of  the  mantissa.  The 
exponent  segment  value,  denoted  by  e,  is  either  a signed  number  or  a biased 
representation,  l.e.,  a fixed  constant  is  added  to  the  exponent  value  to 
eliminate  the  need  for  an  explicit  sign  bit.  The  base  6 is  raised  to  the  power 
Indicated  by  the  exponent.  The  internal  form  of  a floating-point  number,  X, 

is  expressed  mathematically  as  X = ± *dj^d2  ....  d^  x 6 where  each  e-ary  digit, 

d^,  is  an  integer  between  0 and  B-1,  inclusive.  X can  also  be  defined  external 

to  the  machine  as  a base  10  number  X = ± •••  x 10^  where  f is  an 

appropriate  exponent  of  10  and  each  is  between  0 and  9,  inclusive;  T does 
not  necessarily  equal  T' . A number  exactly  representable  in  base  10  need  not 
be  in  another  base  and  vice  versa;  for  example,  is  represented  exactly  by 
.1  in  base  10  but  has  no  exact  equivalent  in  base  2.  Thus  error  can  be  intro- 
duced by  computer  conversion  procedures  to  input  or  output  floating-point 
numbers.  Most  computers  use  bases  2,  8,  or  16  in  their  floating-point  models. 

Figure  2 lists  some  attributes  of  typical  floating-point  designs  on  existing  machines. 

Operands  and  resultants  of  the  basic  floating-point  arithmetic  operations 
are  usually  expressed  in  "normalized"  form,  i.e.,  the  mantissa  is  specified  as 
a fraction  with  a nonzero  first  6-ary  digit,  dj^.  An  unnormalized  number,  (d^  = 0), 
can  be  normalized  by  left  shifting  the  base  6 mantissa  digits  until  d^  0 and 
appropriately  decreasing  the  exponent  value.  When  normalizing,  it  is  helpful 
to  have  available  one  or  more  "guard"  digits  so  that  as  the  shifting  takes  place, 
significant  nonzero  6-ary  digits  enter  into  the  right  end  of  the  mantissa  rather 
than  non-significant  zeros;  such  guard  digits  participate  in  an  intermediate 
stage  of  the  arithmetic  operations  within  the  machine's  registers  and  are  also 
used  for  rounding  purposes.  Normalization  is  performed  with  respect  to  6-ary 
digits,  d^;  thus  a normalized  base  2 number  must  have  a leading  nonzero  mantissa 
bit  whereas  a normalized  octal  or  hexadecimal  number  can  have  up  to  two  or  three 
leading  zero  bits,  respectively.  This  variation  in  the  number  of  loading  zero 
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bits  present  in  high  base  models  contributes  to  fluctuations  in  numerical 
performance. 

The  following  factors  should  be  considered  when  designing  and  using  a 
floating-point  facility:  (1)  accuracy  of  basic  arithmetic  operations  and 
all  the  library  routines;  (2)  number  base  used  for  internal  format;  (3)  total 
number  of  bits  in  the  floating-point  representation;  (4)  mantissa  versus  ex- 
ponent lengths;  (5)  accuracy  of  input  and  output  conversions;  (6)  hardware 
tradeoffs,  such  as  number  base  versus  floating-point  range  or  speed  versus 
size  of  floating-point  model;  (7)  appropriate  range  of  representable  numbers 
for  microcomputer  applications;  (8)  rounding  procedures,  such  as  biased  or 
unbiased  rounding  versus  truncation.  Clearly  with  so  many  interdependent  in- 
fluences it  is  difficult  to  find  a single,  ideal  format.  Both  designers  and 
users  will  have  to  collectively  assess  their  priorities  in  order  to  narrow  down 
the  choices  of  candidates  for  the  best  arithmetic  system  to  meet  their  needs. 

To  illustrate  some  of  the  interactions  amongst  the  above-mentioned  factors, 
we  will  examine  a few  simple  cases.  Let  us  consider  8-bit  mantissas  in  hexadecimal 
and  binary  as  well  as  6-bit  mantissas  in  octal  and  binary.  For  all  cases  we 
will  assume  normalized  representations  and  exponent  values  of  0,  1,  or  2.  The 
vertical  lines  in  Figures  3,  4,  5,  6,  and  7 depict  the  exactly  representable 
numbers  for  each  base  and  format  in  the  range  0 through  5,  inclusive.  The 
corresponding  negative  regions  are  mirror  images  of  those  positive  regions  shown 
in  the  aforementioned  figures.  In  all  the  examples,  the  same  basic  fractional 
unit  ( ) is  used  to  measure  the  distance  between  adjacent  representable 

numbers. 

Several  interesting  relationships  can  be  observed  in  these  four  models. 
Although  we  are  dealing  with  finite  representations  of  the  real  (infinite)  number 
system  and  therefore  must  expect  gaps  between  successive  representable  numbers. 
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it  does  seem  somewhat  surprising  that  the  spacings  are  not  equal,  i.e.,  the 
interval  between  any  two  consecutive  representable  numbers  varies  in  size  de- 
pending on  its  location  and  becomes  larger  as  the  distance  from  0 increases. 
This  situation  is  also  observed  when  moving  away  from  zero  in  a negative 
direction.  For  example,  the  spacings  between  the  floating-point  repre- 
sentation of  1 (or  -1)  and  its  immediate  predecessor  and  successor  are 


1 

256 


and  , 2^  and  and  j,  and  ^ for  bases  16,  2(8  bits),  8,  and  2(6  bits), 

respectively.  Only  between  successive  powers  of  8 are  the  gaps  equal  between 
consecutive  representations. 

The  non-uniform  distribution  of  the  representable  numbers  contributes  to 
the  violation  of  several  properties  of  the  real  line  (see  Figure  10).  For 
example,  16  and  ^ are  exactly  representable  in  our  base  16  format  but  their 
sum  is  not  and  would  have  to  be  assigned  a value  of  16  or  17  in  this  case 
(nearest  representable  numbers  to  result).  Similarly,  it  can  be  shown  that 
subtraction,  multiplication,  and  division  with  exactly  representable  numbers  do 
not  necessarily  produce  exact  results.  In  these  circumstances,  hardware  or  soft- 
ware procedures  have  to  be  used  to  select  the  best  representable  number  to 
approximate  the  exact  results;  such  decisions  are  usually  made  with  the  aid  of 
guard  digits  which  participate  in  a rounding  scheme.  Other  discrepancies  from 
the  behavior  of  the  real  number  system  are  discussed  in  the  next  section. 

We  observe  that  our  larger  number  base  models  (8  and  16)  have  more  members 
in  the  interval  [-1,  1]  than  do  the  corresponding  base  2 models  but  fewer  members 
than  their  binary  counterparts  as  we  move  further  away  from  zero;  this  can  be 
verified  by  the  interval  representation  counts  given  in  Figures  8 and  9.  Also 
in  both  the  8-bit  and  6-bit  cases,  the  base  2 formats  exhausted  their  ranges  of 
representable  numbers  before  the  higher  base  systems  did.  It  should  be  noted, 
however,  that  within  the  base  2 range,  (-4,4),  the  number  of  floating-point 
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representations  in  the  hexadecimal  and  octal  models  is  75%  and  83%,  respectively, 
of  the  corresponding  number  of  base  2 representations.  Thus  higher  number  bases 
provide  wider  ranges  for  the  same  number  of  bits  than  their  binary  counterparts 
but  offer  fewer  exact  representations  outside  the  interval  [-1,  1]. 

For  a microprocessor  system  we  would  most  likely  select  a floating-point 
model  with  multiples  of  8-bit  bytes  for  exponent  and  mantissa  components.  One 
might  at  first  consider  choosing  a hexadecimal  format  with  say  three  8-bit 
parcels  for  the  mantissa  and  one  8-blt  segment  for  sign  and  exponent  along  with 
chopped  base  16  arithmetic;  this  scheme  provides  a wide  range  of  approximately 
10~^^  and  tends  to  reduce  alignment  and  normalization  shifting  (see  Reference 
1-18) . But  such  a model  would  have  the  equivalent  of  only  approximately  7 
decimal  digits  and  would  suffer  from  the  same  representation  problem  exhibited 
by  our  base  16  example,  l.e.,  fewer  exactly  representable  numbers  than  a base  2 
format  in  regions  outside  [-1,  1].  Furthermore,  the  adoption  of  such  a model 
would  probably  be  unwise  in  view  of  a history  of  numerous  complaints  from 
scientific  users  concerning  the  numerical  behavior  of  such  hexadecimal  machines. 

Selection  of  a base  16  system  per  se  is  not  bad;  however,  existing  Im- 
plementations have  tended  to  produce  a combination  of  unfortunate  effects  that 
have  led  to  aberrant  arithmetic  behavior  which  could  have  been  avoided  or 
minimized  by  some  modifications.  For  those  microprocessor  designers  who  believe 
that  the  hardware  implementation  advantages  of  base  16  arithmetic  justify  the 
adoption  of  a hexadecimal  system,  it  is  strongly  suggested  that  alterations  to 
previous  designs  should  include  the  use  of  a longer  mantissa,  say  5 or  6 bytes 
(would  provide  the  equivalent  of  approximately  11  to  lA  decimal  digits, 
respectively) , two  guard  digits,  and  an  unbiased  rounding  scheme  (see  References 
I-A,  1-9).  Several  computational  studies  (see  References  I-l,  I-A,  1-9,  1-37) 
seem  to  Indicate  that  the  combination  of  base  16,  short  mantissa  size, and  truncated 
arithmetic  should  definitely  be  avoided. 


To  promote  a high  level  of  numerical  reliability  for  scientific  com- 
putations performed  on  microcomputers, it  seems  advisable  to  provide  a floating- 
point facility  with  more  than  the  equivalent  of  6 or  7 decimal  digits  and  at 
least  comparable  to  pocket  calculators  using  10  to  13  decimal  digits.  Often- 
times scientific  calculations  involve  the  use  of  one  or  more  library  routines 
(each  of  which  introduces  varying  amounts  of  error  depending  on  argument 
values)  along  with  program  loops  in  which  the  total  floating-point  error 
propagation  can  easily  reach  levels  that  seriously  erode  the  effectiveness 
of  6 or  7 decimal  digits.  Such  situations  could  jeopardize  applications  which 
require  uniformly  reliable  results.  To  produce  a final  output  with  only  a 
few  accurate  leading  digits  may  require  substantially  more  digits  at  inter- 
mediate stages  of  the  computation.  Further  accuracy  enhancement  of  a floating- 
point facility  can  be  achieved  by  selecting  an  appropriate  rounding  scheme; 
several  computational  studies  (see  References  I-l,  I-A,  1-9)  seems  to  suggest 
the  advantage  of  using  rounding  (with  two  guard  digits)  rather  than  truncation, 
particularly  unbiased  rounding  (see  References  1-4,  1-9)  referred  to  as  R*  mode. 

On  the  basis  of  previous  experience  on  large  machines,  it  might  be  prudent 
to  select  a base  2 representation  with  a large  mantissa,  say  4 or  5 bytes  (would 
provide  the  equivalent  of  approximately  10  to  12  decimal  digits,  respectively), 

R*  mode,  and  2 guard  digits.  As  a matter  of  fact,  a simulation  study  (see 
Reference  I-l)  indicates  that  if,  in  addition,  we  use  an  implicit  representation 
in  our  binary  model  (i.e. , the  leading  mantissa  bit  is  implicit  with  the  re- 
striction that  only  normalized  representations  of  nonzero  numbers  are  permitted  in 
the  system)  then  such  a floating-point  system  "is  roughly  equivalent  to  carrying 
one  more  decimal  place"  than  a base  16  representation  with  truncated  arithmetic 
over  essentially  the  same  range.  The  adoption  of  the  binary  model  while  in- 
creasing accuracy  might  have  the  net  effect  of  a decreased  range  and  possibly 
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slower  execution  than  a base  16  counterpart;  however,  the  former  could  be 
alleviated  by  use  of  a longer  exponent  field  and  the  latter  might  be  remedied 
by  hardware  or  architectural  changes  (or  the  use  of  a faster  machine  to  perform 
the  arithmetic)  to  improve  the  floating-point  performance  rate  when  working 
with  multi-byte  mantissas.  Another  alternative  is  to  provide  a relatively 
fast  single  precision,  floating-point  capability  and  a slower  double  pre- 
cision option  which  provides  both  an  increase  in  exponent  range  and  mantissa 
size  as  is  done  on  the  Univac  1108;  mixed  mode  operations  would  have  to  be 
handled  carefully  to  avoid  any  unnecessary  arithmetic  anomalies. 

The  final  form  of  the  floating-point  facility  will  be  strongly  influenced 
by  hardware  speed,  software  support,  and  the  restrictions  imposed  by  micro- 
processor applications.  The  selection  of  a "clean"  floating-point  model  is  not 
sufficient  to  insure  trouble-free  numerical  behavior.  The  hardware  and  soft- 
ware (especially  the  library  routines)  used  to  maintain  the  floating-point 
system  are  capable  of  introducing  their  own  arithmetic  peculiarities  and/or 
intensifying  finite  representational  problems  to  produce  further  deviations 
from  the  behavior  exhibited  by  the  real  number  system.  Some  of  these  difficulties 
are  discussed  in  the  next  section. 

3.  Floating-Point  Behavior  Problems 

There  are  many  anomalies  between  floating-point  behavior  and  that  exhibited 
by  the  real  number  system;  Figure  10  summarizes  several  of  these  situations  and 
Reference  1-17  even  defines  specific  floating-point  representations  for  which  a 
property  of  the  real  number  system  can  fail.  Most  of  the  deviations  from  real 
line  behavior  presumably  affect  a minority  of  floating-point  numbers  in  a particular 
system;  Just  how  many  of  these  "fringe"  cases  are  important  for  a particular  appli- 
cation is  dependent  on  the  number  base,  mantissa  size,  and  rounding  rules  enforced 


by  the  floating-point  facility.  It  is  probably  safe  to  assume  that  one  or 
more  of  these  conditions  will  occur  even  in  small  to  moderate  size  scientific 
computations . 

The  extent  of  deviant  behavior  is  not  only  depfendent  on  the  attributes 
of  the  floating-point  representation  but  also  on  the  influence  of  high-level 
language  compilers.  This  computational  aspect  should  be  of  concern  to  micro- 
processor users  because  of  the  increasing  availability  of  high-level  languages 
including  BASIC,  FORTRAN,  subsets  of  PL/I,  and  ALGOL.  A numerical  result 
produced  with  object  code  from  one  compiler  may  change  when  used  with  code 
from  another  compiler  or  from  another  level  or  option  with  the  same  compiler. 

This  happens  because  there  is  usually  a 1 - to  - many  relationship  between  the 
user's  high-level  language  program  and  possible  machine  language  versions. 
Oftentimes,  particularly  with  optimizing  compilers,  the  sequence  of  floating- 
point arithmetic  operations  is  altered  from  that  specified  by  the  programmer  in 
order  to  produce  object  code  which  will  execute  rapidly;  unfortunately,  many  times 
the  re-arrangement  of  operations  assumes  that  all  the  rules  of  the  real  number 
system  are  valid  when  applied  to  floating-point  numbers.  As  indicated  from 
the  comments  above,  this  assumption  is  not  necessarily  true.  A cautious  user 
aware  of  this  difficulty  may  insert  many  sets  of  explicit  parentheses  to  pre- 
vent arithmetic  sequence  re-orderings  by  the  compiler;  however,  some  compilers 
are  so  "clever"  that  even  this  precaution  does  not  preclude  situations  in 
which  the  compiler  still  hinders  programmer  attempts  to  avoid  numerical  diffi- 
culties as  in  the  case  to  try  to  avert  subtractive  cancellation  reported  by 
Cody  (see  Reference  II-3).  Another  situation  which  was  constructed  by  Kahan 
(see  Figure  11  and  either  Reference  11-15  or  IV-6)  illustrates  that  some  com- 
pilers perform  arithmetic  at  translation  time  with  explicit  constant  ex- 
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pressions  using  a different  sequence  of  round-off  rules  tVian  those  employed  at 
execution  time;  thus  two  programs  which  appear  to  be  equivalent,  except  that 
one  uses  explicit  constants  in  an  arithmetic  expression  and  the  other  uses 
variables  which  have  been  previously  set  equal  to  the  corresponding  constaikLs, 
may  not  produce  the  same  numerical  results.  Even  in  cases  where  compilers 
cause  no  numerical  difficulty  they  can  introduce  significant  variation  of 
execution  times  amongst  object  codes  for  very  similar  source  programs;  for 
example,  Parlett  and  Wang  (see  Reference  1-29)  present  some  cases  occurring  in 
linear  equation  solvers  where  there  is  as  much  as  a 50%  to  90%  variation  in 
execution  times  amongst  object  codes. 

Input  and  output  floating-point  conversion  procedures  are  another  source 
of  floating-point  behavior  problems  since  numbers  exactly  representable  in  one 
number  base  are  not  necessarily  exactly  representable  in  another  base.  Matula 
(see  Reference  I-ll)  has  developed  some  results  to  determine  what  should  be 
the  minimum  size  of  a computer's  floating-point  mantissa  so  that  a number 
input  to  the  machine  could  be  output  as  exactly  the  same  number  or  deviate 
from  the  original  number  by  one  unit  in  the  last  place.  Notice  fiom  tlie  table 
in  Figure  12  that  to  maintain  the  same  number  of  decimal  digits  for  this 
relationship  to  hold,  a hexadecimal  machine  usually  requires  more  bits  in  its 
internal  representation  than  an  octal  machine  and  an  octal  machine  always  re- 
quires more  bits  than  a binary  computer.  Furthermore,  we  observe  that  one  im- 
plication of  this  relationship  is  that  the  number  of  decimal  digits  serving  as 
input  data  to  a program  and  which  also  minimizes  conversion  error  wil],  vary 


considerably  across  existing  machines;  thus,  for  example,  with  a 5 or  6 byte  binary 
mantissa  we  could  input  approximately  12  to  14  decimal  digits  whereas  with  a 3 or 
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4 byte  hexadecimal  mantissa  we  could  input  only  6 to  8 decimal  digits  and  still 
satisfy  the  above-mentioned  assumption. 

Some  microprocessor  designers  may  believe  that  an  easy  way  out  of  the  floating' 
point  behavior  dilemma  is  to  perform  arithmetic  using  pocket  calculator  chips 
(see  Reference  1-55).  Even  assuming  the  speed  differential,  interface,  and 

display  problems  can  easily  be  resolved,  this  can  still  be  an  unsatisfactory 
alternative  at  the  present  time.  The  main  reason  is  that  most  pocket  cal- 
culators introduce  their  own  serious  anomalies  (see  References  1-50,  1-51). 

At  least  one  manufacturer  uses  different  sizes  for  the  accumulator,  stack 
elements,  and  memory  registers  in  such  a manner  that  the  order  of  the  oper- 
ands can  affect  the  results  of  a single  arithmetic  operation,  i.e.,  the 
commutative  law  of  addition  (A+B  = B+A)  or  multiplication  (A*B  = B*A)  can 
be  violated  on  such  calculators.  Peculiar  or  inconsistent  use  of  guard 
digits  and  a penchant  for  truncation  rather  than  rounding  cause  further 
difficulty.  Additional  behavior  problems  on  many  calculators  are  introduced 
by  the  non-uniform  error  behavior  or  poor  quality  of  the  elementary  functions 
over  the  range  of  representable  arguments.  Thus  at  present,  with  a few 
exceptions  (such  as  the  clean  arithmetic  and  elementary  functions  on  the 
HP67  or  97),  it  would  probably  be  unwise  to  utilize  most  of  the  currently 
available  pocket  calculator  chips  in  a microprocessor  system  if  consistent  and 
highly  reliable  numerical  results  have  top  priority. 
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. Algor ithmic  Implementation  Considerations 


From  the  above  discussions  It  should  be  apparent  that  to  produce  highly 
reliable  mathematical  applications  software  tVie  algorithm  .selection  and  Im- 
plementation phases  cannot  afford  to  be  oblivious  to  the  computational  en- 
vironment. Depending  on  the  degree  of  programmer  awareness  and  expertise,  it 
is  possible  to  create  several  implementations  each  of  which  can  produce  different 
numerical  results  for  the  same  problem  on  the  same  machine  with  the  same  compiler; 
of  course,  additional  variations  can  often  be  introduced  by  utilizing  several 
translators  or  by  choosing  different  options  with  the  same  one.  VJhen  selecting 
an  algorithm  the  user  should  attempt  to  determine  its  limitations  for  solving 
his  problem  and  how  the  calculations  can  best  be  Implemented  in  a particular 
environment  to  maintain  numerical  integrity.  This  means  we  must  be  concerned 
with  such  matters  as:  choosing  appropriate  termination  criteria  and  error 
tolerances;  attempting  to  minimize  round-off  errror  propagation;  determining  how, 
when,  and  where  additional  precision  is  needed;  monitoring  error  and/or  other 
factors  to  inform  users  of  the  reliability  of  the  computed  result.  In  this 
section  and  the  next  we  describe  some  actions  for  dealing  with  these  issues. 

Scientific  applications  programmers  should  consider  employing  routines  to 
automatically  determine  computational  environment  parameters  which,  in  turn, 
could  be  utilized  in  implementing  algorithms  and  in  promoting  program  transport- 
ability to  other  machines.  A simple  example  of  such  a routine  is  that  written  by 
Malcolm  (see  Reference  V-12)  and  modified  by  Gentleman  and  Marovich  (see 
Reference  V-  9 ) ; this  code  determines  the  floating-point  number  base,  mantissa 
size,  and  whether  chopping  or  rounding  is  used.  By  inserting  within  an  algorithm's 
implementation  this  or  other  such  program  segments  to  determine  the  computational 
setting  then  such  items  as  error  tolerance,  scaling  factors,  generation  of 
appropriate  number  of  digits  for  program  constants,  termination  criteria,  etc. 


35 


could  be  specified  in  terms  of  the  environmental  features.  The  quality  of 
scientific  applications  software  could  be  greatly  enhanced  by  use  of  arithmetic, 
algorithmic,  and  systems  parameters;  with  such  information,  a program  designer 
could  provide  his  own  fixups  for  overflow  or  underflow  conditions  and  other 
situations  where  the  software  might  fail  due  to  the  computational  surroundings. 
This  could  considerably  ease  the  burden  of  adapting  programs  to  new  machines. 

A discussion  of  what  factors  should  be  considered  as  basic  machine  parameters 
is  given  by  Cody  (see  Reference  V-5). 

Additional  issues  regarding  the  production  of  highly  reliable  and  trans- 
portable math  software  are  being  considered  by  IMSL  (International  Mathematical 

and  Statistical  Libraries,  Inc.),  NAG  (Numerical  Algorithms  Group),  NATS  Project 
(National  Activity  to  Test  Software),  and  the  PORT  Library  Project.  Microprocessor 
applications  programmers  could  benefit  from  a perusal  of  the  activities  in  this 
area.  Section  V of  the  bibliography  in  the  appendix  to  this  paper  lists  several 
references.  Also  Reference  II-ll  provides  some  examples  of  portable  codes  for 
several  problem  areas  in  numerical  analysis. 

Because  of  the  eccentricities  of  floating-point  arithmetic,  certain  pre- 
cautions should  be  taken  when  implementing  an  algorithm.  For  example,  when 
summing  floating-point  numbers  of  the  same  sign,  round-off  error  can  be  reduced 
by  summing  in  order  from  the  smallest  in  magnitude  to  the  largest;  this  can  be 
particularly  helpful  in  inner  product  calculations  as  well  as  in  summing  terms 
of  certain  convergent  series  and  can  provide  noticeable  improvement  over  the 
worst  case  summation  using  the  reverse  order  of  terms.  For  more  general  cases 
involving  the  summation  of  both  positive  and  negative  terms,  various  techniques 
have  been  devised;  some  of  these  arc  discussed  in  References  11-12,  11-14,  11-20, 


Another  very  common  situation  needing  attention  Involves  a phenomenon 
called  subtractive  cancellation,  l.e.,  where  two  numbers  of  approximately 
the  same  magnitude  but  opposite  signs  are  subtracted  from  one  another;  this 


Is  a potentially  hazardous  condition  In  that  the  relative  error  could  be 
quite  large  even  If  the  difference  is  small  in  which  case  the  resulting  error 
propagation  could  have  serious  repercussions  on  the  accuracy  of  succeeding 
calculations.  To  avoid  or  reduce  numerical  deterioration  from  this  effect, 
the  algorithm  can  either  be  re-written  to  eliminate  such  situations  or  a 
higher  precision  can  be  invoked  at  some  intermediate  stage  of  the  calculation 
before  the  subtractive  cancellation  occurs.  Further  examples  of  specific 
algorithmic  computational  difficulties  are  described  in  References  II-9,  11-15, 
and  11-32. 

The  decision  as  to  whether  or  not  to  use  higher  precision  is  problem, 
algorithm,  and  machine  dependent;  the  user  should  be  able  to  determine  if  he 
will  obtain  sufficiently  more  correct  digits  to  justify  the  increase  in  time 
and/or  memory  requirements  involved  in  using  multiple  precision.  Let  us 
examine  a few  possibilities.  Figure  13  illustrates  a situation  in  which  the 
ratio  of  double  precision  to  single  precision  time  for  the  program  segments  used 
for  inner  product  calculations  in  the  solution  of  linear  systems  of  equations  is 
approximately  2.47  for  the  IBM  370,  6.31  for  the  GDC  6500,  and  1.81  for  the 
Univac  1106;  note,  however,  that  for  all  three  machines  the  single  precision 
version  is  never  more  than  one  digit  less  accurate  than  the  double  precision 
version.  For  the  IBM  machine,  which  has  the  equivalent  of  only  6 or  7 decimal 
digits  in  single  precision,  the  increase  in  time  to  obtain  an  extra  digit  can 
probably  be  justified.  For  the  Univac  machine,  which  has  the  equivalent  of  7 or 
8 decimal  digits  in  single  precision,  the  relatively  small  increase  in  time 
could  also  be  justified  for  selection  of  the  double  precision  version.  In 
contrast,  for  the  GDC  computer,  which  has  the  equivalent  of  14  or  15  decimal 
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digits  in  single  precision,  the  substantial  increase  In  time  required  for  the 
double  precision  segment  could  probably  not  be  justified  in  most  cases, 
especially  since  only  one  more  correct  digit  would  be  obtained.  Of  course, 
a final  judgment  in  this  example  is  not  always  clear  cut;  situations  can  arise 
when  one  additional  correct  digit  is  important.  The  point  here  is  to  illus- 
trate some  of  the  possible  tradeoffs.  Quality  math  software  could  provide  the 
option  for  the  user  or  the  program  to  decide  when  to  use  more  precision. 

Another  example  demonstrates  a compromise  between  single  and  double  pre- 
cision, namely  partial  double  precision;  intermediate  results  of  arithmetic 
operations  involving  single  precision  operands  are  performed  in  double  precision 
and  the  final  results  is  truncated  to  single  precision.  In  Figure  14  a simple 
differential  equation  (which  can  be  solved  exactly)  is  solved  by  a numerical 
technique.  All  the  variables  in  the  modified  Euler  formula  are  in  single  pre- 
cision. The  partial  double  precision  version  of  this  equation  is  obtained  by 
replacing  the  expression  for  the  product  of  the  two  terms  on  the  right-hand  side 
with  DBLE  (H)  * DBLE  (F(Xj^  + |,  \ + f * F(Xj^, Yj^) ) ) . The  net  effect  of  this 
change  is  that  the  product  is  computed  in  double  precision  along  with  the 
following  addition  and  then  the  resultant  is  chopped  to  single  precision  be- 
cause is  a single  precision  variable.  Computations  on  this  problem  were 

performed  with  decreasing  step  sizes  on  an  IBM  360/67  (hexadecimal  arithmetic 
with  approximately  6 or  7 decimal  digits)  and  a PDP-10  (binary  arithmetic  with 
approximately  8 decimal  digits).  We  observe  that  for  the  single  precision  version, 
error  decreases  up  to  and  including  the  N = 9 case  but  increases  slightly  on  the 
IBM  machine  for  N = 10.  Note  that  the  PDP-10  single  precision  results  has  a 
noticeably  smaller  error  for  N = 10  than  does  the  IBM  360;  however,  for  the 
partial  double  precision  versions,  the  IBM  results  for  N = 10  is  significantly 
improved  over  its  single  precision  counterpart  but  there  is  no  such  substantial 
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improvement  for  the  PDP-10.  This  example  demonstrates  that  for  a specific 
algorithmic  implementation  on  a specific  machine  for  a specific  problem  it 
may  be  advantageous  to  use  partial  double  precision  (IBM  360/67  in  this  case) 
whereas  it  may  not  be  worthwhile  on  another  machine  (PDP-10  in  this  case).  It 
would  be  very  hazardous  to  generalize  from  this  example  since  the  aforementioned 
observations  are  very  problem,  algorithm,  and  machine  dependent;  however,  this 
illustration  does  indicate  some  options  the  applications  programmer  should 
consider  when  designing  his  software. 

One  of  the  most  insidious  influences  upon  algorithmic  implementations  is 
the  effect  of  the  elementary  library  support  functions  such  as  sin,  cos,  log, 
etc.  The  microprocessor  community  is  rapidly  increasing  its  exposure  to  such 
functions  as  more  and  more  high-level  language  facilities  are  becoming  available. 
Most  users  automatically  presume  that  the  associated  errors  are  negligible  or 
at  least  uniform  over  all  arguments.  Such  persons  are  strongly  urged  to  care- 
fully read  References  1-35,  1-49,  1-50,  1-51,  II-5,  II- 9,  and  11-15.  Some 
pocket  calculator  users  have  already  become  victims  of  the  oddities  exhibited 
by  some  of  these  functions  (see  References  1-50,  1-51).  Sometimes  the  numerical 
performance  of  a user-written  program  can  be  degraded  by  an  elementary  library 
routine  which  does  not  even  explicitly  appear  in  the  program  but  is  employed 
by  another  function  which  is  explicitly  present.  (This  situation  also  occurs 
on  pocket  calculators.)  Another  problem  is  that  range  reduction  techniques  em- 
ployed by  some  of  these  functions  produce  very  poor  results. 

The  persons  responsible  for  providing  such  library  routines  should 
thoroughly  examine  their  behavior  rather  than  passively  accept  those  used 
on  large  machines  and  tacitly  assume  that  they  are  satisfactory.  Relatively 
little  has  been  done  to  extensively  check  out  the  numerical  reliability  of  library 
routines  currently  available  as  part  of  a commercial  high-level  language  facility; 


some  of  the  implementations  are  very  naive.  The  serious  microprocessor  user 
who  intends  to  perform  scientific  computations  should  consider  applying  some 
of  the  available  testing  techniques  (mentioned  in  Section  5)  to  check  out  the 
numerical  behavior  of  his  library  routines.  A few  algoritlims  for  some  of  the 
elementary  functions  as  well  as  implementation  considerations  are  given  in 
References  II-2,  II-5,  II-8,  11-15,  11-17,  11-29,  11-33. 

The  programmer  should  try  to  provide  high  quality,  robust  implementations 
of  his  algorithms;  such  software  should  warn  users  of  potentially  poor  results, 
monitor  or  estimate  error,  and  be  thoroughly  tested  and  well-documented,  in- 
cluding clearly  defined  program  limitations  and  a performance  profile.  A 
table  like  that  given  in  Figure  15  would  be  helpful  in  indicating  to  the 
user  how  the  error  tends  to  vary  as  the  argument  range  is  changed.  If  more 
than  one  algorithm  is  used  (usually  called  a polyalgorithra) , the  switching 
criteria  should  be  examined  to  determine  if  the  borderline  cases  are  being 
handled  properly.  Furthermore,  the  program  designer  should  make  sure  that 
his  thoughtfulness  in  implementation  is  not  sabotaged  by  the  compiler;  for 
example,  Cody  (see  Reference  II-3)  cites  a case  in  which  to  avoid  subtractive 
cancellation,  1-Y  was  re-written  as  (.5-Y)  + .5  but  unfortunately  the  compiler 
treated  this  as  1-Y  when  generating  code. 

The  moral  is  to  know  the  behavior  of  the  floating-point  arithmetic, 
compiler,  elementary  functions  as  well  as  the  algorithm  and  problem  if  you 
want  to  provide  high  quality  mathematical  software.  As  Figure  16  indicates, 
the  sources  of  error  in  problem  solving  are  all  closely  related;  one  false 
move  at  one  stage  can  cause  severe  repercussions  elsewhere.  Cody  discusses 
the  attributes  of  quality  software  in  Reference  1-A9.  .Section  II  in  the 
bibliography  lists  sources  of  algorithm  design  experiences. 


5.  Testing  Techniques  for  Evaluating  Numerical  Reliability 
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Several  procedures  are  available  for  assessing  the  quality  of  a floating- 
point support  facility  and  mathematical  applications  software.  In  this  section 
we  briefly  describe  some  of  the  approaches  and  their  limitations.  Sections  III 
and  IV  of  the  bibliography  in  the  appendix  give  references  to  many  of  the  tech- 
niques  employed  for  error  monitoring  and  performance  testing. 

A common  practice  has  been  to  check  out  most  routines  for  only  a very  small 
sampling  of  input  data;  usually  a few  cases  with  known  results  are  examined  along 
with  some  additional  cases  with  random  arguments.  This  approach,  although  widely 
used,  is  unsatisfactory  in  this  form  for  adequately  testing  math  software  re- 
liability because  the  use  of  a collection  of  pseudo-random  arguments  cannot 
insure  that  all  numerically  aberrant  behavior  for  specific  data  values  or  sub- 
intervals of  representable  numbers  will  be  discovered.  This  situation  is  further 
complicated  by  the  influences  of  computerized  floating-point  behavior  which  is 
dependent  on  the  number  base  selection,  mantissa  and  exponent  lengths  as  well  as 
arithmetic  round-off  rules;  thus,  data  sets  which  might  adequately  exercise  the 
range  of  representable  numbers  of  one  computer  might  be  entirely  insufficient 
for  another  machine  which  possesses  different  floating-point  attributes. 

Modifications  of  the  above-mentioned  approach  have  been  proposed  to  improve 
the  situation.  Cody  (see  Reference  IV-5)  has  suggested  that  the  entire  range  of 
representable  numbers  should  be  exercised  for  the  input  arguments  of  a tested 
routine  by  the  use  of  both  selected  bit  patterns  and  carefully-controlled  sets 
of  pseudo-random  data;  he  advocates  that  the  complete  floating-point  number  range 
be  decomposed  into  a set  of  sub-intervals  and  then  a collection  of  random  bit 
patterns  be  tested  in  each  interval , and  the  computed  output  compared  with  exact 
results,  entries  from  published  tables,  and/or  numerical  results  from  calcu- 
lations performed  in  higher  precision  than  is  employed  by  the  tested  program. 
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If  the  amount  of  error  associated  with  the  computed  results  of  the  test  routine 
varies  substantially  in  any  sub-interval,  then  that  interval  is  further  sub- 
divided and  testing  commences  with  pseudo-random  arguments  in  these  new  inter- 
vals in  an  attempt  to  isolate  the  argument  ranges  causing  numerical  problems. 

In  addition,  sub-interval  boundary  points,  selected  bit  patterns  based  on  the 
floating-point  representation,  and  any  cross-over  and  neighboring  data  points 
where  the  routine  changes  algorithms  are  also  tested.  From  this  procedure  a 
table  can  be  generated  like  the  one  shown  in  Figure  15  in  which  argument  range 
sub-intervals  are  given  along  with  the  corresponding  N-bit  errors,  maximum 
relative  error,  and  the  root-mean-square  error;  such  a table  can  warn  the  po- 
tential user  of  argument  ranges  to  avoid  as  well  as  indicate  the  overall  pro- 
gram accuracy.  This  approach  and  variations  of  it  (see  References  IV-16, 

IV-17)  could  be  useful  in  testing  library  and  applications  routines  for  micro- 
processors. 

Of  course,  like  all  other  available  techniques,  it  cannot  guarantee  to 
locate  all  inaccurate  situations;  only  an  exhaustive  testing  of  all  possible 
representable  combinations  of  input  data  values  would  do  that  and  this  is 
usually  considered  to  be  too  costly  and/or  impractical.  The  above-mentioned 
suggested  modifications  do,  however,  significantly  improve  the  chances  of  de- 
tecting problem  cases.  At  the  present  time  there  is  no  automatic  generation  of 
selected  bit  patterns,  thus  requiring  that  such  data  be  hand-generated  and 
changed  from  machine  to  machine  because  of  computer  arithmetic  dependencies. 

Bit  patterns  are  used  rather  than  decimal  floating-point  numbers  in  order  to 
separate  the  transmitted  error  induced  by  the  conversion  of  base  10  input  to 
internal  floating-point  format  from  error  induced  by  exact  input  data.  The 
effects  of  the  conversion  errors  can  be  examined  separately,  if  desired. 

During  both  the  phases  of  program  design  and  production  use,  it  is  help- 
ful of  assess  numerical  reliability  with  the  aid  of  error-measuring  techniques. 
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The  "textbook"  type  of  error  analysis  is  usually  not  too  effective  in  practical 
situations;  it  tends  to  be  concerned  only  with  the  truncation  error  (or  often 
a non-optiomal  bound  on  this  error)  associated  with  the  creation  of  an  algorithm 
and  usually  totally  ignores  computer  round-off  error  propagation  introduced  by 
the  implementation  including  error  from  use  of  the  library  support  routines. 

Some  typical  error  analysis  approaches  can  be  classified  as  follows:  1)  error- 
bounding  schemes;  2)  forward  error  analysis;  3)  backward  error  analysis; 

4)  multiple  precision  arithmetic.  A compilation  of  references  to  these  tech- 
niques is  given  in  Section  III  of  the  bibliography  in  the  appendix. 

Error-bounding  methods  produce  bounds  to  the  computed  solution.  The  most 
well-known  of  these  schemes  is  interval  analysis  developed  by  Moore  and  associates 
(see  Reference  III-4).  It  is  based  upon  use  of  a closed  interval  associated  with 
each  input  variable  and  accompanying  arithmetic  rules  for  interval  operations  to 
trace  through  the  calculations  associated  with  a particular  routine.  The  com- 
puted result  of  a problem  in  n-dimenslons  is  an  n-dimensional  parallelplped  which 
contains  the  exact  solution.  For  large  computations,  the  widths  of  the  inter- 
mediate intervals  can  grow  rapidly  and  often  require  substantial  computational 
overhead  time.  Although  interval  analysis  can  be  time-consuming,  it  can  be 
applied  to  a wide  range  of  problems.  Improvements  and  variations  of  interval 
analysis  are  given  in  Reference  VI-1  and  newer  results  are  in  Reference  111-5. 

A forward  error  analysis  involves  the  calculation  of  a result  and  then  a 
comparison  of  it  to  some  reference  value  known  to  be  exact  and/or  computed  to 
higher  precision  than  that  used  in  the  results  being  evaluated.  Comparisons 
with  published  table  entries  and  multiple  precision  output,  such  as  is  often 
done  in  argument  testing,  are  examples  of  forward  error  analysis.  Cody,  Hammer, 
and  others  (see  References  IV-5,  IV-9,  IV-19)  make  use  of  mathematical  ident- 
ities or  a Monte  Carlo  approach  to  perform  a forward  error  analysis  for  param- 
eter selection  and  evaluation. 


Backward  error  analysis  has  been  developed  primarily  by  Wilkinson  (see 
References  1-47,  1-48).  With  the  application  of  this  approach  it  is  assumed 
that  the  computer  results  of  a routine  is  the  exact  solution  to  some  problem 
and  an  attempt  is  then  made  to  determine  how  close  the  problem  solved  is  to 
the  original  problem  which  was  intended  to  be  solved.  The  presumption  is 
that  if  we  are  dealing  with  well-conditioned  cases  then  if  both  problems  are 
close  then  their  results  will  also  be  close  and  appropriate  bounds  can  be 
determined.  This  approach  has  also  been  utilized  by  Stoutemyer  (see  Reference 
1-44)  with  the  aid  of  a symbolic  manipulation  system  to  generate  symbolic 
floating-point  round-off  error  expressions  for  each  stage  of  the  calculation 
and  to  use  these  expressions  to  aid  in  the  analysis. 

Work  based  on  Wilkinson-type  backward  error  analysis  has  also  been  used  to 
develop  an  approach  to  automatic  stability  analysis  (see  References  1-39, 

1-40).  This  technique  is  concerned  with  obtaining  an  algorithm's  worst  case 
round-off  error  rather  than  attempting  to  monitor  error  associated  with  each 
possible  data  set.  Tlie  approach  has  great  potential  for  economically  de- 
termining if  a method  is  numerically  unstable.  This  is  accomplished  by  the  use 
of  FORTRAN  programs  which  define  a heuristic  search  for  a specific  input  data 
set  that  produces  large  round-off  errors  for  the  algorithm  being  tested.  Since 
the  computerized  approach  is  heuristic,  there  is  no  guarantee  that  even  if  one 
or  more  such  data  sets  exists,  the  program  will  necessarily  find  one.  The 
method  has  been  successfully  applied  to  several  numerical  linear  algebra 
methods  and  can  be  used  in  other  problem  areas  with  algorithms  involving  al- 
gebraic processes  which  satisfy  certain  limitations. 

Multiple  precision  arithmetic  is  another  alternative  for  assessing  numerical 
reliability.  The  main  premise  is  that  increasing  the  precision  under  which  a 
mathematical  routine  is  executed  can  lead  to  extremely  accurate  results  which,  in 
turn,  can  be  used  to  judge  the  accuracy  of  a program's  single  and  double  pre- 
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clsion  performance.  Most  multiple  precision  packages  are  defined  in  software 
and  can  therefore  be  very  slow.  It  has  been  estimated  (see  Reference  1-5)  that 
execution  time  increases  linearly  with  precision  for  addition  and  quadratically  j 

for  multiplication;  time  increases  can  be  far  worse  for  extended  precision  j 

function  evaluations.  : 

There  are  at  least  three  packages  known  to  this  author  which  attempt  to  ' 

1 

provide  a portable  extended  precision  facility.  The  system  constructed  at  the  | 

National  Bureau  of  Standards  (see  Reference  III-12)  is  written  in  ANSI  FORTRAN  1 

i 

and  can  be  used  with  a precompiler  to  scan  certain  super  precision  data  types;  | 

it  also  includes  a FORTRAN  extended  precision  library  of  standard  functions  and 
permits  multiple  precision  calculations  in  bases  2 to  16.  The  Hull  and  Hofbauer 
package  (see  Reference  1II-9)  was  originally  written  in  ALGOL  W and  runs  on 
the  IBM  360/370  series;  a FORTRAN  version  is  currently  under  development.  Brent 
(see  Reference  III-7)  has  constructed  97  routines  written  in  ANSI  FORTRAN;  his 
package  has  been  run  on  a PDP-IO  and  11,  IBM  360/370  series,  Univac  1108  and 
1110,  and  on  a CDC  Cyber  70,  model  76.  All  three  packages  may  be  of  interest 
to  the  microprocessor  community  both  as  a source  of  ideas  as  to  how  to  design 
extended  precision  elementary  functions  and  as  a tool  for  checking  out  the 
numerical  results  of  existing  or  proposed  implementations  of  library  and 
applications  routin.  s. 

In  addition  to  riuraerical  reliability  and  error  measures,  some  other  commonly 
used  performance  criteria  include  central  processor  time,  overhead  time,  number 
and  type  of  arithmetic  operations  performed,  number  of  function  and  derivative 
evaluations,  storage  requirements,  cost,  efficiency,  and  stability  region. 

Definition  of  cost,  efficiency,  and  reliability  can  vary  considerably.  References 
IV-11,  IV-12,  IV-15,  IV-20  discuss  various  performance  standards.  Informal 
collections  of  test  problems  have  evolved  and  have  been  used  in  several  areas; 
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References  lV-7,  IV-12,  lV-13,  IV-15,  lV-20  list  some  available  test  sets  and 
others  are  given  in  Reference  VI-3. 

Numerous  computational  case  studies  have  been  performed.  Analysis  of 
their  results  should  include  an  awareness  of  performance  criteria,  test  case 
selection,  computer  algorithms  and  implementations  investigated,  and  the  com- 
putational environment  in  which  the  programs  were  executed.  The  methodologies 
employed  in  these  studies  should  be  of  interest  to  microprocessor  applications 
programmers.  Some  of  the  more  extensive  studies  include  those  mentioned  in 
References  IV-2,  IV-7,  IV-12,  IV-15,  IV-20  and  others  listed  in  Reference  VI-3. 

It  is  hoped  that  from  the  comments,  examples,  and  discussions  in  this  section 
and  in  the  rest  of  the  paper,  the  reader  has  become  aware  of  and  appreciates 
the  problems  in  designing,  testing,  and  using  a floating-point  facility  with 
its  supporting  hardware  and  software.  In  the  next  few  years,  the  increasing 
use  of  microprocessors  for  such  applications  as  financial  calculations, 
sophisticated  games,  process  control,  and  optimization  should  create  a demand 
for  high  speed  and  reliable  f loatihg-point  arithmetic.  If  the  reader  carefully 
considers  the  issues  presented  in  this  paper  and  in  the  references  in  the 
accompanying  bibliography,  he  should  be  able  to  benefit  from  past  experiences 
on  large  machines  and  be  well  on  his  way  to  intelligently  designing  and  using 
a high-quality  floating-point  facility. 
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Figure  1:  Format  for  Internal  Floating-Point  Representation 
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Figure  2:  Some  Properties  of  Typical  Internal 
Floating-Point  Representations 
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There  exists  one  or  more  values  of  floating-point  numbers  A,B,  and  C such 


that  one  or  more  of  the  following  situations  occur 


1.  Failure  of  Associative  Law 

i.e.  (A+B)  + C A + (B+C) 

(A*B)  * C ^ A * (B*C) 

2.  Failure  of  Distributive  Law 

i.e.  A * (B+C)  (A*B)  + (A*C) 

3.  Failure  of  Cancellation  Law 

i.e.  A 0 and  A*B  = A*C  then  B -ff  C 

A.  Failure  of  Closure  Law 

i.e.  if  A and  B are  exactly  representable 
floating-point  numbers  then  it  is 
possible  that  the  results  of  A+B,  A-B,  A*B, 
or  A/B  are  not  exactly  representable  or 
undefined 

5.  Lack  of  Multiplicative  Identity 

i.e.  one  or  more  floating-point  values  of 
A can  exist  such  that  A*1  i‘  A 

6.  Floating-Point  Values  of  A and  B Can  Exist  Such  that 

A * Y B for  A 0 
A 

7.  Non-Uniform  Distribution  of  Floating-Point  Number  Representations 

i.e.  if  A,  B,  and  C are  three  successive  floating-point  numbers 
then  it  is  possible  that  |b-a|  ^ |c-B| 

(for  examples,  see  Figures  3, A, 5, 6, 7). 

8.  Weakening  of  Inequality  Relationships 

A<B  does  not  necessarily  imply  A+C<B+C  but  does  imply  A+C<B+C 

A<B  and  C<D  does  not  necessarily  imply  A+C<B+D  but  does  imply  A+C<B+D 

B<C  and  A>0  does  not  necessarily  imply  A*B<A*C  but  does  imply  A*B<A*C 


Figure  10 : 


Some  Properties  of  Floating-Point  Arithmetic 
Wliich  Differ  from  Those  for  Real  Arithmetic 
(from  References  1-17,  1-27) 
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Figure  11;  Example  of  Problem  Exhibiting  Roundoff  Error  Propagation 
and  Possible  Inconsistency  between  Results  from 
Compile-Time  and  Execution-Time  Arithmetic  (from  Reference  IV-6) 
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Figure  12:  Summary  of  Minimum  Number  of  Digits  Needed 


in  Binary,  Octal,  and  Hexadecimal  Mantissas 
So  That  Rounded  In-and-Out  Conversions  Will 
Identically  Recover  Any  N-Digit  Decimal 
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*from  T.J.  Aird,  "Linear  Equation  Solvers,"  IMSL  Numerjcal 
Computations  Newsletter,  Issue  #5, October  1973. 


Figure  13:  Example  of  Tradeoff  between  Accuracy  and  Time* 
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Figure  15:  Example  of  Argument  Testing  Results  for 
a Subroutine  to  Compute  X**Y  (from 
Reference  IV-5) 
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The  advent  of  nicroprocessors  as  building  blocks  for  the  realization  of  computing  devices  has  revolution- 
ized the  computer  world.  Attempts  are  being  made  to  design  computer  systems  as  a network  of  LSI  process- 
ing elements.  This  has  been  motivated  by  the  low  cost,  speed,  flexibility  and  capabilities  offered  by 
the  fast-growing  LSI  and  microprocessor  technology.  Such  networks  provide  the  capability  of  tailoring 
the  system  to  a given  application  through  functional  orientation,  besides  providing  high  reliability. 
However,  before  such  systems  become  practical  there  are  many  design  issues  such  as  control  to  be  resolved 
This  paper  is  an  attempt  to  focus  on  these  issues  to  g‘'ve  a more  coherent  perspective.  Several 
hardware  issues  and  software  issues  related  to  network  operation  and  man  machine  interface  are  discussed. 


INTRODUCTION 


The  advent  of  microprocessors  as  building  blocks  for  the  realization  of  computing  devices  has  revolution- 
ized the  computer  world.  Earlier  generation  microprocessors  were  designed  with  specific  system  charac- 
teristics in  mind  end  were  mostly  1/0  oriented.  However,  as  the  technology  evolved,  they  are  no  longer 
simple  microprocessors  of  the  past.  A typical  microprocessor,  nowadays,  has  a powerful  and  flexible 
Instruction  repertoire,  has  a lower  price  tag  and  executes  its  instructions  faster.  Because  of  their 
speed,  flexibility  and  powerful  instruction  repertoire  they  are  becoming  the  building  blocks  for  the 
future  computer  systems.  These  characteristics  coupled  with  the  advances  in  solid  state  memory  technol- 
ogy have  made  it  possible  to  realize  a variety  of  computer  organizations  as  a network  of  nicroprocessors. 
Such  networks  also  provide  the  capability  of  tailoring  the  system  to  a given  application  such  as  process 
control,  through  functional  orientation.  As  the  microprocessor  technology  progresses,  attempts  are  also 
being  made  to  design  computer  systems  as  a distributed  network  of  microprocessors. 

EVOLUTION  OF  MICROPROCESSORS 

Historically  Intel  Corporation  introduced  the  first  microprocessor  into  the  market.  The  first  generation 
microprocessors  (1970-73)  were  4 bit  wide  LSI  processors  designed  using  p-channel  MOS  technology  (eg. 
Intel  4004).  The  first  generation  microcomputer  systems  based  on  these  microprocessors  were  originally 
Intended  for  implementation  as  intelligent  terminals.  Later  they  v;ere  extended  to  8 bit  wide  systems, 
with  an  8 bit  data/address  1/0  buses  interfacing  the  chips  with  external  memories.  The  second  generation 
microprocessors  emerged  (1972)  with  wider  lengths  and  n-channel  technology  (eg.  Intel  8080,  M6800,  etc.). 
The  systems  based  on  these  microprocessors  could  provide  higher  throughput  because  of  their  longer  word 
length  for  both  addressing  and  instruction.  These  8 bit  microprocessor  architectures  are  very  similar  to 
minicomputers  providing  direct  memory  access  channel  capacity  for  faster  input-output  data  transfers. 

Then  came  the  third  generation  microprocessors  that  are  based  on  bipolar  and  SOS  technology.  Architec- 
turally these  microprocessors’  data  widths  range  from  4 to  16  bits  wide.  Figure  1 is  a simplified  block 
*Rescarch  supported  by  National  Science  Foundation  Grant  HCS72-03734-A02. 
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diagram  of  a general  microprocessor  along  with  its  memory  and  I/O  facility.  The  microprocessor  can  com- 
municate with  the  external  v/orld  via  this  I/O  facility.  In  addition  the  processor  might  also  be  able  to 
conriunicate  with  the  external  world  if  the  memory  is  shared  by  others.  It  is  through  these  two  facilities 
that  a microprocessor  can  communicate  with  other  microprocessors  making  it  feasible  to  interconnect 
several  microprocessors  together. 

Currently  there  are  several  microprocessors  available  in  the  market  with  varying  word  lengths,  architec- 
tures and  speeds.  Tables  la,  b,  c,  d,  e show  currently  available  microprocessors.  The  trends  indicate 
that  a nev/  generation  of  microprocessors  with  larger  word  lengths,  higher  speeds,  with  extensive  and 
powerful  instructions  will  be  emerging  that  could  encompass  a wide  spectrum  of  applications  rather  than 
be  limited  to  terminal  oriented  applications.  In  addition,  microprograimabil ity  of  these  upcoming  micro- 
processors would  remove  the  problems  associated  with  software  thus  raking  the  use  of  microprocessors 
limited  only  by  the  imagination  of  the  designers.  One  possible  way  of  extending  the  use  of  microproces- 
sors to  applications  that  are  so  far  limited  by  the  availability  of  large  scale  computers  is  to  connect 
them  as  a network  of  processors. 

DEVELOPMENT  OF  MICROPROCESSOR  BASED  COMPUTER  SYSTEMS 

Basically  a microprocessor  based  network  consists  of  a sol  of  microprocessors  connected  together  in  a 
specified  manner  as  defined  by  the  architectural  requirerents.  These  neU.’orks  might  be  built  with  the 
currently  available  microprocessors  or  with  the  LSI  processors  designed  to  perform  certain  specific 
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fable  la.  4-bit  processors. 


Type 

Technology 

Address 

Capacity 

No  of 
Chips 

Manufacturer 

1 

1 

4004 

P4004 

P KOS 

4K 

1 

Intel  (prog  log) 

ti 

2 

4040 

P4040 

P MOS 

4K 

1 

Intel  (national) 

3 

PPS-4 

P MOS 

4K 

1 

Rockwel 1 
(national) 

4 

PPS  4/2 

P MOS 

8K 

2 

Rockwel 1 

5 

PPS  4/1 

P MOS 

— 

1 

Rockwell 

6 

TKS-1000 

P MOS 

8K 

1 

Texas  Instrument 

7 

SX  200 

P MOS 

IK 

1 

Essex 

8 

TDY-52A 

P KOS 

— 

1 

Tcledyne  Corp. 

■ 

9 

IMP-4 

P KOS 

4K 

— 

National 

it 

10 

TCS 

P MOS 

8K 

1 

National 

, « 

11 

HPD751D 

N MOS 

4K 

1 

NEC 

12 

HD35404 

— 

4K 

1 

Hitachi 

1 

13 

T3271 

P MOS 

4K 

1 

Toshiba 

Table  lb.  8-bit  processors. 


Type 

Technology 

Address 

Capacity 

Ho  of 
Chips 

Manufacturer 

1 

8008 

P HOS 

65K 

1 

Intel  (MIL) 

2 

8080 

N MOS 

esK 

1 

Intel  (AMD.  TI, 

NEC,  Siemens) 

3 

6502 

N HOS 

65K 

MosTech  (AMS) 

4 

5065 

P HOS 

32K 

HosTech 

5 

6800 

N HOS 

65K 

1 

Motorola  (AMI) 

6 

F-8 

N HOS 

65K 

Fairchild  (MosTech) 

7 

E-A  9002 

N MOS 

65K 

Electronic  Arrays 

8 

SCAMP 

P MOS 

6SK 

National  (Rockwell 
Western  Digital) 

9 

1801 

C HOS 

65K 

RCA 

10 

1802 

C HOS 

65K 

RCA 

n 

COSMAC 

C HOS 

65IC 

2 

RCA 

12 

ATMAC 

C HOS 

65K 

1 

RCA 

13 

PPS-8 

P HOS 

32K 

2 

Rockwell 

14 

PPS  8/2 

P HOS 

32K 

2 

Rockwel 1 

15 

2650 

N MOS 

32K 

1 

Signetics  (AMS) 

16 

SMS  300 

TTL-S 

(bipolar) 

8K 

-- 

Scientific 

Microsystems 

(Signetics) 

17 

Z-80 

N HOS 

65K 

1 

ZiLog 

18 

IMP-8 

P HOS 

65IC 

— 

National 

19 

CMP-8 

N MOS 

65K 

— 

National 

20 

LP  8000 

P MOS 

65K 

“ 

General  Instruments 

21 

p Com  8 

— 

— 

" 

NEC 

22 

Burroughs  Mini-D 

P MOS 

— 

256 

Burroughs 

23 

HPD753D 

N MOS 

64K 

1 

NEC 

24 

H587105 

N HOS 

64K 

1 

Mitsubishi 

25 

HD36401,2 

P MOS 

64K 

2 

Hitachi 

26 

MB8861 

— 

64Kx8 

1 

Fujitsu 
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Table  Ic.  12-bit  processors 


fi 


r 


Type 

Technology 

Address 

Capacity 

No  of 
Chips 

Manufacturer 

1 

6100 

C KOS 

4K 

1 

Intersil  (Harris) 

2 

T3153 

TLCS-12 

P MOS 

4K 

1 

Toshiba 

3 

T3190 

TLCS-12A 

96K 

1 

Toshiba 

Table  Id.  16-bit  processors 


Type 

Technology 

Address 

Capacity 

No  of 
Chips 

Manufacturer 

1 

CP  1600 

N MOS 

65K 

General  Instruments 
(EM  & n) 

2 

HCP-1600 

N MOS 

65K 

3 

Western  Digital 
(national ) 

3 

IMP- 16 

P MOS 
(bipolar 
to  appear) 

65K 

1.  2 

National 

4 

PACE 

P MOS 

65K 

National  (Rockwell 
Western  Digital) 

5 

PFL-1600A 

N MOS 

128K 

3 

Panafacom 

6 

THS  9900 
SBP9900 

N MOS 

I^L 

65K 

1 

Texas  Instruments 

7 

HicroPro-16 

— 

— 

— 

Plessey 

8 

MPD755D 

HPD756D 

N MOS 

— 

2 

NEC 

9 

HN1610 

N MOS 

64Kx  16 

1 

Matsushita 

10 

T3412 

N MOS 

64K 

1 

Toshiba 

11 

— 

SOS 

— 

1 

ETL 

! 


Tabic  le.  Bit  sliced  microprocessors. 


r 

► 

t- 

\ 


1 

Address 

No  of 

1 

Type 

Technology 

Capacity 

bits/slice 

Manufacturer 

I 

1 

2901 

Schottky 
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4 
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functions.  In  order  to  exploit  these  advances  in  LSI  and  microprocessor  technology  to  meet  the  ever 
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increasing  demand  for  higher  throughput  and  availability,  several  studies  have  been  made.'  » • • ' 

These  studies  are  based  on  one  of  the  two  general  classes  o.f  systems:  the  conventional  multiprocessor 
architectures  and  the  recently  emerging  distributed  intelligence  systems. However,  the  success  and 
usefulness  of  these  networks  depend  on  the  effective  means  of  interconnecting  these  microprocessors. 

Such  interconnection  mechanisms  must  provide  a means  for  efficient  intercommunication  and  control.  In 
addition,  there  are  problems  associated  with  the  partitioning  of  a job  into  a set  of  tasks  and  scheduling 
them.  Also  problems  arise  in  programming  the  jobs  on  a microprocessor  network  and  in  providing  an  inter- 
face that  makes  the  network  organization  transparent  to  the  user.  These  design  issues  such  as  intercon- 
nection, and  their  relative  importance  depend  on  the  intended  goals  of  the  systems.  For  example,  if  the 
microprocessor  network  is  intended  for  a general  purpose  computing  environment,  a flexible  but  efficient 
interprocessor  communication  mechanism  plays  a very  important  role.  On  the  other  hand  in  a microprocessor 
network  designed  to  meet  specialized  application  requirements  the  comunication  requirements  arc  more 
predictable  and  the  design  of  the  communication  mechanisms  is  relatively  easy.  These  issues  related  to 
system  architecture,  netv/ork  control,  operating  systems,  language  support  and  development  tools  are  dis- 
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(1.3, 5. 6) 


SYSTRI  ORGANIZATION 


There  are  many  different  approaches  to  organize  computer  systems  as  a network  of  microprocessors.  One 
approach  is  to  build  microprocessor  networks  following  Flynn's  classification  which  is  based  on  their 
procedures  and  data  streams. This  classification  divides  the  systems  into  four  categories  viz.  SISO 
(Single  Instruction  Stream  Single  Data  Stream),  HISD  (Multiple  Instruction  Stream  Single  Data  Stream), 
SIMD  (Single  Instruction  and  Multiple  Data  Stream),  MIHD  (Multiple  Instruction  Stream  and  Multiple  Data 
Stream).  The  SISD  organization  is  characterized  by  single  instruction  stream  as  in  any  uniprocessor  sys- 
tem such  as  IBM  360,  or  CDC  7600.  However,  following  these  familiar  large  scale  computer  organizations  a 
microprocessor  network  may  be  built  in  which  each  microprocessor  performs  a specific  function.  Figure  2 
shows  one  such  possible  organization.  In  this  the  bipolar  LSI  microprocessors  are  configured  as  a set  of 
microprocessors  working  in  parallel.  The  peripheral  processors  and  memory  modules  can  be  multiplexed 
through  fast  bus  switches.  The  system  control  itself  can  be  implemented  by  one  or  more  microprocessors. 
However,  such  organization  of  microprocessor  network  requires  a very  complex  software  and  the  overhead 
Incurred  in  scheduling  the  tasks,  and  intermodule  communication  is  high. 


The  second  category,  HISD,  is  characterized  by  multiple  Instructions  operating  on  a single  data  stream  as 
in  a pipeline  processor  such  as  the  CDC  STAR.  In  this  type  of  organization  each  of  the  microprocessors 
can  perform  the  function  of  a stage  in  a pipelined  system.  This  organization  can  be  further  extended  to 
accommodate  multiple  instruction  streams.  For  example,  one  or  more  microprocessors  can  be  organized  as 
instruction  fetching  units,  another  set  for  instruction  decoding  and  operand  fetching  and  yet  another  set 
for  execution.  These  sets  of  microprocessors  when  interconnected  properly  could  execute  multiple 
instruction  streams  simultaneously.  Successful  operation  of  this  scheme  requires  schemes  for  tagging  the 
instructions  as  they  proceed  along  the  pipe.  This  scheme  offers  high  reliability  because  of  the  presence 
of  multiple  functional  units.  However,  the  main  difficulty  of  this  scheme  lies  with  the  efficient  inter- 
connection and  instruction  tagging  schemes. 


The  third  category,  SIMD,  is  char'cterized  by  a single  instruction  operating  on  multiple  data  sets  as  in 
an  array  processor  such  as  the  Illiac  IV.  This  implies  that  a single  set  of  basic  control  sequences  are 
applied  across  a number  of  processing  units,  each  of  which  is  associated  with  a data  sequence.  All  pro- 
cessing units  must  act  in  a synchronous  manner.  A microprocessor  network  can  be  built  based  on  this  type 
of  organization  where  the  application  environment  requires  the  same  computation  performed  on  all  data 
sets  and  there  exists  no  unknovm  .global  time  relationship  betv/cen  data  sets.  A slight  variation  of  this 
organization  is  a federated  system  or  a master/slave  configuration  as  shown  in  Figure  3.  A federated 


system  consists  of  several  processors  each  of  which  is  dedicated  to  a particular  task  and  the  tasks  are 
performed  in  a multiprograinned  mode  of  operation.  In  a master  slave  configuration  several  microprocessors 
could  be  connected  as  slaves  to  a master  computer  which  is  usually  a large  scale  computer.  In  these  types 
of  organizations  the  microprocessors  could  be  identical  or  they  can  be  customized  to  perform  a specific 
task.  These  types  of  networks  find  their  use  in  applications  such  as  traffic  control,  process  control, 
etc.  The  advantages  of  such  systems  are  simplicity  and  reliability. 

However,  the  design  of  SIHD  type  of  organization  as  a network  of  microprocessors  in  general  is  constrained 
by  the  nature  and  time  relationships  of  its  data  streams.  In  addition,  this  organization  could  be  effec- 
tive only  in  special  applications  in  which  the  designer  can  construct  a functional  analog  of  a physical 

system.  The  typical  applications  include  resource  allocation,  hardware  executions,  protection  mechanisms, 
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communication  multiplexing,  data  compression,  sorting,  document  retrieval,  data  file  searching,  etc.'  ' 

The  last  category,  HIMD,  is  characterized  by  multiple  instructions  operating  on  multiple  data  streams. 

This  category  is  analogous  to  typical  multiprocessor  systems.  However,  the  limited  speed  and  instruction 
capabilities  restrict  the  design  of  microprocessor  networks  based  on  this  organization. 

The  second  approach  to  exploit  the  advances  in  LSI  technology  is  to  partition  the  processing  logic  and  to 
map  the  partitions  onto  several  LSI  processors.  Partitioning  is  the  process  of  selecting  portions  of  the 
system  logic  capable  of  being  implemented  on  a single  LSI  chip  without  violating  design  constraints  such 
as  the  number  of  I/O  pins  and  number  of  circuits  on  each  chip.  The  idea  of  such  partitioning  is  to  maxi- 
mize the  use  of  LSI  circuits  and  gate  to  pin  ratio  and  to  minimize  the  total  number  of  circuits  required. 
Logically  this  approach  is  a mapping  of  each  of  the  printed  circuit  boards  of  an  SSl/MSI  design  onto  a 
single  LSI  circuit.  An  example  is  the  design  of  the  Amdahl's  V/6  machine  CPU.  Such  partitioning  and 
packaging  had  the  advantage  of  higher  speed  and  less  non-logic  hardv/are.  But,  there  are  certain  disad- 
vantages associated  with  this  approach  such  as  high  design  cost  to  produce  low  volume  LSI  circuits  and 

the  limited  number  of  pins  available  for  cornnunicating  v/ith  the  external  world.  Hov/ever,  the  initial 

(3) 

design  costs  can  be  reduced  by  using  gate  array  approaches  in  the  fabrication  phase,'  ' which  use  two 
levels  of  metalization.  In  this  approach  a fixed  set  of  gates  are  always  used  while  providing  different 
interconnections  at  the  second  level. 


The  third  approach  is  to  design  computer  systems  as  a network  of  functionally  oriented  LSI  processors.  j 

These  LSI  processors  are  themselves  components  of  a larger  processor  and  each  having  an  ALU  unit  and  a 1 


small  amount  of  memory.  These  processors  perform  various  functions  such  as  arithmetic  operations, 
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address  translation,  bus  allocation,  protection,  etc.  in  adoition,  microprocessors  can  be  used  as  dedi- 
cated I/O  processors  supporting  the  main  processors.  The  I/O  processor  can  itself  be  a network  of  micro- 
processors, each  one  performing  functions  such  as  code  and  format  conversions,  buffer  and  queue  managa- 
ment,  device  address  translation,  device  handling,  etc.  These  computer  systems  can  be  further  intercon- 
nected to  realize  multiple  processor  systems  to  achieve  higher  performance  as  shown  in  Figure 
However,  the  challenging  problem  in  this  type  of  organization  is  to  find  effective  means  of  decomposing 
the  system  functions  and  effectively  mapping  onto  the  microprocessors  while  minimizing  the  communication 
overhead. 


Fig.  4,  Microprocessor  network  as  a multiple  processor  system. 
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The  effective  use  of  microprocessors  in  any  one  of  the  organizations  depends  on  other  hardware  issues 
such  as  the  type  of  interconnection  or  bus  organization,  memory  organization,  timing  and  control,  address-' 
Ing.  These  issues  are  further  discussed  in  the  following  sections. 

BUS  ORGANIZATION^^®^ 

The  interconnection  scheme  between  the  elements  of  any  computer  system  is  one  of  the  most  crucial  factors 
that  facilitates  the  transfer  of  data  and  control.  This  factor  is  more  prominent  in  the  multiple  processor 
systems.  A general  configuration  of  a multiple  microprocessor  is  shown  in  Figure  5.  However,  there  are 
many  different  ways  in  which  the  system  bus  may  be  organized  and  this  organization  may  be  classified  into 
6 classes  (Figure  6):  i)  time  shared  or  common  bus  organization,  ii)  fully  interconnected.  Hi)  star, 
iv)  crossbar  switch,  v)  loop  or  ring  and  vi)  a tree  or  a combination  of  these.  In  a microprocessor  net- 
work based  common  bus  organization  the  microprocessor  communicate  with  each  other  and  with  memory  and  I/O 
through  a time  shared  bus  as  shown  in  Figure  6a.  This  type  of  interconnection  is  straightforward  and 
easy  to  implement.  However,  as  the  number  of  processors  increases,  the  bus  might  become  the  bottleneck 
and  may  need  multiple  buses.  In  a fully  interconnected  organization  each  microprocessor  is  directly  con- 
nected to  every  other  microprocessor  in  the  network.  This  scheme  offers  the  advantage  of  low  message 
transfer  delays  and  higher  reliability,  but  tends  to  become  complex  as  the  number  of  processors  increases. 
Secondly  the  communication  between  processors  may  lead  to  deadlocks.  In  a star  configuration  the  micro- 
processors can  be  connected  through  a switching  center  through  v;hich  all  the  processors  communicate. 

With  this  type  of  organization  it  is  possible  to  optimize  the  interprocessor  communication.  But  as  the 
number  of  processors  increase  the  switching  node  might  become  a bottleneck  and  any  failure  in  this  node 

may  lead  to  total  system  failure,  in  a mesh  configuration  the  processors  are  arranged  along  the  edges  of 

(221 

a crossbar  switch  as  in  C.imip.'  ' This  scheme  offers  faster  response  time  and  higher  connectivity  but 
the  complexity  is  propor=tiona1  to  the  square  of  the  processors.  In  a ring  configuration  the  processors 
are  connected  by  means  of  a loop.  These  loops  may  in  turn  be  connected  in  a hierarchical  fashion  as  in 
pj*.(8)  scheme  offers  the  advantages  of  simplicity  and  flexibility.  The  CM*  is  a multimicropro- 

CGSSor  network  being  built  at  Carnegie-Hellon  University  to  investigate  the  problems  of  building  and  pro- 
gramming the  system  with  large  numbers  of  microprocessors  (Figure  7).  Finally  in  a tree  structure  the 
processing  elements  are  connected  in  the  form  of  a tree  and  this  type  of  organization  is  most  suitable  in 
application  where  the  functions  can  bo  structured  in  a hierarchical  fashion. 

However,  the  choice  of  a scheme  or  a combination  of  schemes  is  dictated  by  the  requirements  of  bandwidth, 
reliability,  flexibility,  cost,  number  of  1/0  pins,  control  complexity  and  finally  the  application 
requirements. 


L_ 
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The  next  major  issue  concerning  the  microprocessor  based  systems  is  the  timing  and  control  distribution 


and  is  mainly  dependent  upon  the  type  of  system  organization  adopted.  The  activities  in  a microprocessor 
network  have  to  be  synchronized  at  two  different  levels,  the  local  level  and  the  global  level.  Local 
level  synchronization  is  required  to  control  the  functions  which  are  local  to  a particular  microprocessor. 
The  global  level  synchronization  deals  with  the  coordination  of  various  activities  of  the  individual 
microprocessors  and  is  concerned  with  the  control  of  information  flow  between  various  processors.  These 
two  levels  of  synchronization  schemes  require  multiple  clocks  and  cause  a serious  problem  in  the  imple- 
mentation due  to  skewing  and  reliability  requirements. 


The  control  of  the  microprocessors  may  be  centralized  or  distributed  and  the  type  of  control  is  largely 
dictated  by  the  system  organization,  nature  of  application  and  the  reliability  and  real-time  requirements. 
For  example,  in  a master/slavc  configuration  all  the  microprocessors  may  be  controlled  by  a single  master. 
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The  addressing  technique  for  designating  the  location  of  data  and  instructions  is  one  of  the  crucial 
factors  for  the  success  of  a microprocessor  network.  Because  of  short  word  lengths,  narrow  internal 
buses  and  low  speeds  of  microprocessors,  the  conventional  addressing  schemes  often  used  in  large  scale 
computers  are  inefficient  and  costly.  These  schemes  include  indirect  addressing,  indexing,  etc.  In 
Indirect  addressing  the  address  of  the  data  address,  rather  than  the  data  address  itself,  is  included 
with  the  instruction.  In  indexed  addressing  the  contents  of  an  index  register  are  added  to  the  address 
explicitly  included  with  the  instruction.  In  addition  there  are  several  other  addressing  schemes  such 
as  direct,  stack  addressing  that  are  more  suitable  for  microprocessors.^^^^  However,  the  need  for  a 
larger  address  space  in  a multimicroprocessor  based  system  requires  the  development  of  sophisticated 
schemes.  One  such  scheme  is  the  centralized  address  mapping  scheme  used  in  The  CM*  system  con- 

sists of  a set  of  clusters  of  computer  modules.  Each  computer  module  consists  of  a processor,  P^,  a 
local  memory,  Mp,  a number  of  ports,  K maps  which  allow  interconnection  to  other  CHs  and  an  intra-CM 
switch  S.  Each  cluster  consists  of  several  of  these  modules  and  the  clusters  themselves  are  connected 
via  inter-CM  buses  as  shown  in  Figure  8a.  In  this  system  the  address  mapping  for  several  modules  (micro- 
processors) is  done  by  a shared  mapping  unit  called  K map.  This  K map  has  several  features  to  perform 
address  mapping  and  routing  of  requests  between  modules  connected  to  different  K map  units.  The  virtual 
address  space  of  this  system  is  divided  in  up  to  2^®  segments  v;hich  are  defined  by  segment  descriptors 
which  specify  physical  base  address  and  the  length  of  the  segments.  Capabilities  are  also  associated 
with  these  segments  to  define  protection.  The  K maps  are  capable  of  routing  single  word  memory  access 
requests  independent  of  the  processors  connected.  It  also  permits  the  CHs  to  conmunicate  in  one  to  many 
and  many  to  one  address  mapping  as  illustrated  in  Figures  8b  and  8c.  However,  this  scheme  is  efficient 
in  special  purpose  applications  v/here  a proce.ss  can  be  bound  to  a processor  and  involves  little  or  no 
mu  1 1 i DJ=©gr  a nm  i n g . 
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*1ICR0PR0CESS0R  NETWORK  SOFTWARE 


Microprocessor  network  software  may  be  classified  into  the  software  for  the  control  of  the  network  opera- 
tion and  the  software  for  man-machine  interface.  Network  operation  software  is  related  to  the  network 
operating  systems  and  deals  with  the  issues  such  as  interprocessor  communication  task  partitioning.  The 
man-machine  interface,  as  the  name  implies,  refers  to  the  means  of  communication  between  the  user  and  the 
system.  Users'  interface  with  the  system  is  through  various  languages,  at  different  levels  of  abstrac- 
tion, from  bit-level  to  assembly  level  to  high  level.  The  microprocessor  network  ui  ar/designer  interacts 
with  the  system  for  the  purpose  of  system  design,  development  and  maintenance  through  an  assortment  of 
software  tools  such  as  compilers,  simulators,  debuggers,  etc.  These  tools  provide  a suitable  environment 
for  the  development  of  production  software,  design  of  new  systems  and  maintaining  the  existing  ones.  In 
the  following  sections  these  software  issues  in  a microprocessor  network  environment  are  examined. 

TASK  PARTITIONING 

In  order  to  take  advantage  of  potential  advantages  of  the  low  cost  LSI  technology  the  problem  of  the 
application  must  be  decomposed  into  parallel,  cooperating  processes.  The  question  of  v/hether  the  archi- 
tecture is  designed  to  support  the  application  or  the  application  is  designed  to  suit  the  architecture  is 
simply  the  issue  of  special  purpose  versus  general  purpose  systems.  The  decomposition  process  or  parti- 
tioning establishes  the  matching  between  the  architecture  and  the  application.  Partitioning  of  an  appli- 
cation refers  to  the  division  of  the  job  into  a disjoint  set  of  processes  or  tasks  which  v/hen  integrated 
logically  would  represent  the  original  job.  Partitioning  techniques  similar  to  multitasking  facilities 
in  larger  computer  systems  could  be  used.  However,  the  limited  capabilities  of  microprocessors  limit  the 
use  of  such  techniques  directly.  These  limitations  arise  because  of  limited  memory  and  addressing  capa- 
bilities associated  with  the  microprocessors,  high  speed  and  critical  time  response  requirements  of  the 
tasks  and  the  communication  overhead.  Because  of  these  limitations  it  is  economical  to  design  application 
oriented  distributed  microprocessor  architectures.  Such  systems  permit  the  use  of  firnnvard  centered 
design  that  enables  the  development  of  systematic  task  partitioning. 

INTERPROCESSOR  COlillUNI  CAT  lOH^  ^ ^ 

The  interprocessor  communication  deals  with  the  exchange  of  information  between  various  microprocessors 
in  the  network.  The  manner  in  which  the  comnunication  takes  place  botv(ecn  various  microprocessors  is 
primarily  dictated  by  the  system  organization.  However,  because  of  the  simplicity  of  microprocessors  any 


89 


techniques  for  interprocessor  communication  in  microprocessor  networks  should  be  simple  and  uniform 
throughout  the  network.  Normally  the  interprocessor  comnunication  is  achieved  through  standard  protocols. 
A protocol  is  a predefined  set  of  rules  that  control  the  communication  betv/een  any  two  or  more  entities 
such  as  processors.  The  protocol  defines  the  status  information  to  be  exchanged  and  maintains  coordina- 
tion between  various  asynchronously  operating  processes. 

The  protocols  in  a multimicroprocessor  network  could  be  broadly  divided  into  two  classes;  1)  centralized 
protocols  and  2)  distributed  protocols.  In  a centralized  system  the  communicating  processors  are  local 
to  one  another  and  often  have  a common  medium  of  communication  and  control  such  as  central  memory.  Hence 
the  centralized  protocol  often  takes  the  form  of  a set  of  rules  governing  the  addressing  of  shared  memory 
and  access  to  shared  information.  On  the  other  hand  in  distributed  protocols  all  the  processes  are  con- 
sidered remote  to  one  another  and  do  not  deal  with  shared  tables  and  memory  but  deal  with  explicit  mes- 
sages between  any  two  communicating  processes.  Because  of  this  distributed  nature  and  lack  of  any  single 
control  over  the  entire  environment  the  problems  of  addressing,  routing,  flow  control  and  the  error  con- 
trol become  extremely  important.  However,  the  choice  between  centralized  and  distributed  protocols  is 
largely  dictated  by  the  size  and  the  nature  of  the  application,  the  transmission  rates,  delays,  relia- 
bility and  the  flexibility  requirements. 

LANGUAGE  ISSUES 

Microcomputer  languages  fall  in  the  same  classes  as  minicomputer  languages;  machine,  assembly  and  high- 
level.  However,  the  relative  frequency  of  use,  the  number  of  languages  in  each  class  and  the  conditions 
under  which  the  languages  are  used  are  quite  different.  For  example,  there  is  a great  deal  of  application 
work  being  done  directly  in  machine  language.  This  is  mainly  because  of  the  "primitive"  nature  of  the 
atmosphere  in  which  much  of  the  work  is  being  done.  The  configurations  of  most  microcomputer  systems  are 
restricted  to  "bread  board"  systems  that  lack  the  memory  and  input/output  capacity  to  support  program 
development  in  assembly  language.  Many  of  the  programs  for  such  systems  are  written  in  octal  or  hexa- 
decimal, directly  onto  PROMs,  using  inexpensive  PROM  prograirming  equipment. 

Recent  trends  in  microprocessor  software  have  been  toward  development  of  tools  which  make  programming  of 
these  systems  easier  and  less  error  prone.  For  example,  perfectly  useful  assembly  language  programming 
tools  are  available  for  the  most  popular  microcomputers.  These  include  text  editing  programs  similar  to 
those  currently  available  on  minicomputers,  assemblers,  loaders  and  on-line  debugging  aids  that  make  it 
possible  to  set  up  microcomputer  program  development  environments  that  arc  comparable  to  minicomputer 
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assembly  language  development  environments. 

Progress  is  being  made  in  the  area  of  system  level  and  high  level  language  development,  but  the  reper- 
toire of  these  languages  is  not  as  rich  as  it  is  for  minicomputers.  The  most  commonly  used  high-level 
language  for  microprocessors  is  the  PL/M  language  developed  by  Intel  Corporation  for  use  with  the  8008 
and  8080  microprocessors. Other  derivatives  of  PL/M  such  as  PL/M6800^^^  by  Motorola  and  the  National 
Semiconductor's  PL/H^^^  have  recently  been  developed  for  the  microprocessors'^^  produced  by  these  com- 
panies. There  is  interest  in  the  development  of  compilers  for  other  languages  such  as  BASIC  and  FORTRAN 
but  it  remains  to  be  seen  whether  this  interest  is  sufficiently  Intense  to  promote  actual  commercial 
development  of  these  compilers.  The  small  word  size  of  the  most  common  microprocessors  has  been  an  impe- 
diment to  their  use  in  applications  that  require  heavy  calculating  capabilities.  As  a result,  there  has 
been  less  motivation  for  FORTRAN  and  BASIC  compilers  than  might  otherwise  have  been  the  case.  With  the 
advent  of  16-bit  microcomputers,  this  situation  is  changing  and  will  probably  promote  an  increased  level 
of  compiler  development  in  the  near  future. 

Microprocessor  based  computer  networks  utilize  a collection  of  microprocessors  as  their  basic  functional 
unit  to  achieve  a higher  computational  pov/er,  reliability  and  flexibility.  It  was  mentioned  in  previous 
sections  that  such  networks  may  replace  not  only  the  traditional  I/O  controllers,  but  also  the  minicom- 
puters and  even  more  powerful  machines.  As  such,  such  factors  as  small  memory  size,  limited  I/O  capa- 
bilities and  operating  systems  are  no  longer  a stumbling  block  to  development  of  suitable  high-level 
languages. 

A microprocessor  based  network  consisting  of  a number  of  possibly  nonhomogeneous  microprocessors  provides 
new  challenges  to  language  designers.  Obviously  some  of  the  emphasis  on  language  features  are  shifted  as 
the  system  characteristics  and  application  requirements  are  different  from  those  of  a single  processor 
based  microcomputer.  Microcomputer  networks  have  usually  larger  storage  capacity,  provide  better  compu- 
tational power,  a more  powerful  1/0  capability  along  vnth  a higher  degree  of  interface  complexity.  Thus, 
they  favor  the  usage  of  higher  level  languages  than  what  is  now  available  for  conventional  microcomputers. 

The  most  com/non  means  of  man-machine  interface  in  microprocessor  networks  is  through  assembly  language 
programming.  Due  to  a larger  storage  generally  available  in  a network  environment,  it  is  possible  to 
reside  tic  assembler  directly  in  the  computer  system.  However,  if  the  processing  elements  are  nonhomo- 
gcncous,  there  will  be  the  need  for  several  resident  asscnblers,  each  for  a particular  microprocessor, 
thus  causing  a serious  drainage  on  the  amount  of  storage  available.  Therefore,  for  a multi-microprocessor 
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network  It  is  desirable  to  have  a single  meta  assembler,  capable  of  supporting  a number  of  nonhomogeneous 
microprocessors,  thus  saving  in  storage  without  sacrificing  the  flexibility  and  reliability  offered  in  a 
network  environment. 

Meta  assembler  is  a general  purpose  assembler  which  accepts  the  characteristics  of  a particular  micro- 
processor as  input  and  then  translates  a machine  independent,  but  microprocessor  oriented,  assembly  lan- 
guage into  the  equivalent  machine  code  for  the  target  microprocessor  (Figure  9).  The  machine  charac- 
teristics may  be  input  as  a series  of  subroutines,  tables,  instructions  of  a suitable  specification 
language  or  a mixture  of  any  of  the  three.  There  are  few,  if  any,  proposals  for  a suitable  meta  assembler 
and  the  supporting  assembly  language.  Davidso^^^roposed  a microprocessor  programming  language  v;hich  was 
similar  to  PDP-11  assembly  code  in  format  and  contained  most  of  the  existing  microcomputer  instructions 
as  a subset.  Instead  of  a meta  assembler,  however,  he  proposed  development  of  an  assembler  for  each 
popular  microprocessor,  with  translation  to  be  done  in  four  passes  using  a combination  of  macro-like 
expansion  and  compilation  techniques.  This  approach  is  similar  to  the  proposal  for  an  UMCOL  (stands  for 
lINiversal  computer  Language)  which  was  proposed  in  the  late  50's  as  a means  of  translator  writing  for  a 
large  class  of  computers,  although  in  this  case,  the  scope  is  much  more  limited. 

It  is  felt  hov/ever  that  Davidson's  scheme  in  writing  separate  assemblers  for  each  microprocessor  is 
restrictive.  Many  functions  such  as  bit  packing,  resolving  of  forv/ard  references,  etc.  are  essentially 
Independent  of  a particular  processor  architecture  and  may  be  uniformly  implemented  for  all  the  micro- 
processors. It  is  therefore  proposed  that  the  construction  of  the  Heta  Assembler  be  broken  into  two 
distinct  parts: 

a)  Development  of  processor  independent  modules  of  the  Heta  Assembler  in  a 'skeletal'  form.  Such 
modules  are  machine  independent  but  are  controlled  by  the  parametric  inputs  which  constitute  the 
machine  description. 

b)  Development  of  the  medium  through  which  the  processor  dependent  characteristics  can  be  described  to 
the  Meta  Assembler. 

Thus  there  are  two  phases  in  construction  of  an  assembler  for  a processing  element  of  the  microprocessor 
network.  In  the  first  phase  all  the  general  purpose  modules  of  the  Heta  Assembler  which  are  necessary 
for  the  translation  of  the  Meta  Assembly  language  into  machine  language  are  written.  These  modules  may 
be  developed  on  a powerful  computer  in  a suitable  language  such  as  FORTRAfi  or  BASIC,  thus  called  the 
Cross  Heta  Assembler.  Once  this  phase  is  completed,  it  need  not  be  repeated  for  any  new  microprocessor. 
The  second  phase  involves  preparation  of  the  characteristics  of  the  individual  microprocessors.  The 
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Fig.  9.  Generation  of  an  assembler  using  the  Meta  Assembler. 

medium  in  which  this  description  is  provided  may  vary  and  may  be,  among  other  things,  the  host  language, 
i.e.  the  language  in  which  the  first  phase  was  written,  e.g.  FORTRAN  or  BASIC.  It  is  felt,  however, 
because  of  the  nature  of  the  description  which  is  oriented  toward  the  hardv/are  characteristics  of  the 
microprocessors  a conventional  hardware  description  such  as  ISP  is  more  suitable  for  this  purpose.  The 
ISP  has  the  capability  which  allows  the  microprocessor's  operational  characteristics  such  as  assembly 
instruction  repertoire,  machine  langi;age  instructions  and  their  representative  bit  patterns,  registe- 
transfer  paths,  interrupt  flags,  error  conditions  to  be  described  in  a precise  and  unambiguous  manner  and 
yet  be  manipulated  by  an  automatic  processor.  While  the  first  phase  in  the  design  of  the  Meta  Assembler  is 
carried  on  only  once,  the  second  phase  should  be  repeated  for  each  nonhomogoneous  microprocessor  in  the 
network.  Figure  9 illustrates  the  different  phases  and  the  operational  diagram  of  the  Meta  Assembler. 

High  level  language  issues  related  to  microprocessor  networks  are  similar  to  the  Assembly  language  and 
the  development  of  assemblers  for  such  a system.  If  the  netv/ork  consists  of  homogeneous  microprocessors, 
the  task  of  high  level  language  compiler  development  is  greatly  simplified.  Again,  the  enhancement  in 
computational  pov/er  of  the  computer  due  to  harmonious  activities  of  the  individual  microprocessor 
elements,  coupled  with  a wider  variety  of  I/O  devices  and  a larger  storage  capacity  bring  about  an 
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environment  in  which  high-level  language  compilers  can  conveniently  be  developed.  Such  a compiler  is 
however  totally  machine  dependent  in  order  to  achieve  a high  utilization  of  the  target  machine's  critical 
resources.  For  example,  PL/H  is  a very  suitable  high-level  language  for  a network  consisting  of  individ- 
ual 8080  microprocessors.  Note  also  the  small  word  size  of  the  microprocessors  is  not  a stumbling  block 
to  development  of  the  compilers  for  such  scientifically  oriented  languages  as  FORTRAN  or  BASIC  because 
It  is  possible  to  logically  integrate  microprocessors  to  perform  as  a single  more  powerful  processing 
element  with  a wider  bandwidth,  with  the  obvious  sacrifice  in  speed.  Such  high  level  languages  may  be 
used  for  software  development  on  microprocessor  based  computers  in  situations  where  the  execution  time  is 
not  a crucial  factor. 

I,  A network  consisting  of  nonhomogeneous  microprocessors  poses  a new  challenge  to  high-level  language 

(compiler  developers.  This  challenge  is  in  two  directions:  1)  develop  tent  of  a suitable  language  which 

can  be  efficiently  mapped  on  any  of  the  microprocessors  and  2)  development  of  compiling  techniques  which 
can  produce  object  code  for  a number  of  microprocessors  in  a formal  way  without  causing  a great  deal  of 

i 

' ■ reprogramming  effort. 


Development  of  a high-level  language  for  multi-microprocessor  networks  is  an  open-ended  problem.  Most 
present-day  languages  are  not  suitable  candidates  since  they  are  either  too  machine  independent  or  machine 
dependent.  Any  proposed  high-level  language  should  consist  of  tv/o  parts.  The  first  part  is  a kernel 
which  ’.z  common  among  all  microprocessors.  This  core  of  the  language  is  machine  independent  and  its 
features  reflect  the  requirements  of  the  application  environment  and  suit  the  algorithms  v/hich  are  to  be 
written.  The  second  part  is,  however,  machine  dependent  and  varies  from  microprocessor  to  microprocessor. 
While  the  machine  dependent  features  provide  a means  of  controlling  the  network's  resources  and  their 
efficient  utilization,  such  features  may  be  put  aside  in  many  cases  where  the  network  is  to  be  viewed  as  a 
general  purpose  microprocessor-based  computer. 

Generation  of  efficient  code  for  a number  of  nonhomogeneous  microprocessors  is  not  an  easy  task.  In  prac- 
tice it  is  required  that  for  each  microprocessor  new  code  generation  nodules  must  be  written  and  used  when 
needed.  This  is  a tedious  task  and  incurs  a major  prograttming  cost  v;hen  the  number  of  nonhomogeneous 
microprocessors  is  large.  In  very  recent  years,  there  has  been  some  effort  to  alleviate  this  problem. 

In  particular,  there  have  been  attempts  to  apply  the  techniques  already  developed  for  transporting  the 
compilers  between  various  machines  in  order  to  develop  automatic  code  generating  systems  for  micropro- 


cessors. Bunza  proposes  a scheme,  involving  a system  to  generate  machine  code  for  many  different  micro- 
processors, utilizing  a single  compiler,  a single  code  generator  and  processor-specific  data  bases. 


Adaptation  of  the  code  generator  to  new  processor  architectures  is  claimed  to  be  limited  to  me  creation 
of  a single  processor  description  data  base.  The  code  generating  system  translates  or  interprets  BCPL 
OCODE^^®^  with  a hierarchical  operator  definition  tree,  manipulates  data  classes,  and  generates  machine 
code  utilizing  a processor  description  language  which  is  a derivative  of  ISP.  This  and  other  techniques 
require  a thorough  knowledge  of  the  processor  description  language  end  the  intermediate  medium  (BCPL  in 
this  case)  and  are  subject  to  manual  tuning  after  the  processor  description  is  complete.  Their  useful- 
ness is  yet  to  be  seen  in  a real  commercial  environment. 


SOFTWARE  TOOLS  FOR  SYSTEM  DEVELOPMENT 


Another  area  of  major  interest  in  microprocessor-based  computer  systems  Is  the  software  aids  which  can  be 
utilized  for  the  design,  development  and  evaluation  of  new  systems.  There  have  already  been  significant 
Innovations  in  this  direction,  especially  in  single  microprocessor  computer  systems.  Simulators  have 
long  been  Introduced  for  the  design,  checkout  and  evaluation  of  microprocessors.  Simulators  are  software 
packages  which  execute  the  operations  specified  by  an  input  medium  (e.g.  program  process  signals,  etc.) 
and  simulate  the  response  of  target  systems.  Simulators  are  convenient  tools  for  verifying  the  design 
before  actual  construction.  Most  manufacturers  of  microprocessors  provide  'canned'  simulators  for  their 
microprocessors,  usually  written  in  FORTRAN  and  run  on  a time-sharing  system.  In  very  recent  years  there 
have  been  some  attempts  to  deviate  from  the  concept  of  'canned'  simulator  packages  and  to  develop  tools 
which  can  be  used  to  'create'  new  simulators  for  various  microcomputers. 

SIMlC  (stands  for  Simulator  for  Microprocessors)  is  a microcomputer  software  development  and  evaluation 
package  which  has  been  designed  and  implemented  on  a Digital  Equipment  Corporation  PDP-11  minicomputer. 
SIMIC  consists  of  a main  module  containing  machine  independent  routines  and  data  and  a submodule  ivhich 
contains  a particular  microprocessor  characteristic  (Figure  10).  Generation  of  a new  simulator  requires 
the  preparation  of  microprocessor  dependent  information  in  tables  and  subroutines  which  organize  the  sub- 
module  part  of  the  SIMIC.  Programming  of  a submodule  is  oriented  to  those  structures  in  v/hich  micropro- 
cessors differ.  As  part  of  the  SIMIC,  a general  purpose  set  of  programs,  called  the  MICRO  module,  has 
been  developed  to  facilitate  the  implementation  of  microcomputer  cross-assemblers.  A cross-assembler  for 
a specific  microprocessor  is  formed  by  linking  a submodule  to  the  main  MICRO  module.  This  submodule 
contains  tables  and  a specialized  subroutine  which  tailors  the  main  module  to  a specific  microprocessor. 
The  assembler  implemented  with  MICRO  is  applicable  to  generate  microcomputer  software  for  control  tasks. 
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Fig.  10.  SIHIC  modules. 

The  SIMIC  package  provides  the  bookkeeping  mechanism  for  evaluating  microprocessor  software  in  terms  of 
design  effort  and  speed  of  operation  and  hardware  in  terms  of  the  computer  and  external  hardware  costs. 
The  evaluation  process  focuses  on  the  tradeoffs  between  the  design  and  computer  costs. 

Simulator  generators  such  as  SlMlC  are  very  desirable  for  microprocessor-based  computer  systems.  In  a 
reconfigurable  netv/ork  environment  where  the  application  requirements  demand  changes  in  system  charac- 
teristics, the  'canned'  simulators  cannot  respond  to  such  demands  in  sufficiently  short  time  to  justify 
their  usefulness.  Simulation  of  a multi-microprocessor  network,  in  general,  consists  of  the  following 
steps : 

(a)  simulation  of  individual  processing  elements 

(b)  simulation  of  the  physical  configuration,  i.e.  processor-memory  module  interconnections,  bus  struc- 
ture, I/O,  etc. 

(c)  simulation  of  the  control,  i.e.  Interprocessor  communication,  task  distribution  and  allocation, 
priority  assignment,  etc. 

Simulation  is  an  important  step  in  the  design  and  implementation  of  computer  networks.  It  not  only  spe- 
cifies the  characteristics  that  individual  processing  elements  should  have  to  meet  the  application 
requirements  (e.g.,  speed,  proper  instruction  set,  proper  bandwidth,  etc.)  but  it  also  suggests  the  best 
network  configuration,  bus  structure  and  any  other  parameters  which  influence  the  design  constraints.  In 
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many  cases,  simulation  runs  of  a network  detect  run>time  hazardous  conditions  which  might  arise  such  as 
deadlock,  race  condition,  integrity  problems,  prior  to  the  construction  of  the  network. 

If  a network  simulator  is  to  be  used  as  a development  tool  to  answer  some  of  the  problems  above.  It  should 
not  be  canned,  I.e.,  it  should  not  simulate  a specific  network  of  fixed  characteristics  and  configuration. 
What  is  in  fact  needed  is  a software  system  development  tool  to  be  called  INTERactive  Emulator  R^enerator 
for  microprocessor  networks  or  INTERSIG.  INTERSIG  can  be  viewed  as  a general  purpose  network  simulation 
package  to  be  available  on  a time-sharing  system  which  provides  a simulation  environment  for  the  design 
checkout  and  maintenance  of  the  microprocessor  based  computer  network.  Similar  to  SIMIC,  IKTERSIG  con- 
sists of  general  machine  independent  modules  which  interface  with  the  outside  world  through  tables  auto- 
matically created  by  a processor  from  the  network  description.  It  is  felt,  however,  that  the  network 
characteristics  should  be  described  via  a suitable  specification  language  such  as  an  augmented  ISP.  This 
allows  for  interactive  development  of  the  network  description  and  early  checkout  of  the  design  using 
Standard  software  tools. 


Through  the  specification  language  instructions,  the  user  specifies  the  rigid  characteristics  of  individ- 
ual microprocessors,  e.g.  instruction  set  and  the  equivalent  machine  code,  word  size,  addressing  modes, 
registers,  inter  and  intra  submodule  communication  paths,  timing  requirements,  etc.  Such  description 
should  in  fact  be  sufficient  to  simulate  the  complete  operations  of  the  individual  processing  elements. 
Also  instructions  should  be  provided  to  describe  the  netv.ork  configuration,  including  interconnection 
between  the  processing  elements  and  other  subsystems,  bus  structure  and  other  characteristics  which  are 
crucial  in  the  definition  of  a network.  The  complete  specification  ‘program*  is  input  to  the  INTERSIG 
which  then  processes  them  just  like  any  other  translator,  fills  up  its  internal  data  structure  with  the 
information  and  subsequently  tailors  itself  to  the  specification,  producing  a new  simulator  for  the 
microprocessor  network. 


INTERSIG  resides  on  a powerful  time-sharing  system  to  utilize  the  run-time  supports  normally  available  on 
such  systems.  General  purpose  simulation  packages  such  as  SIMIC  or  INTERSIG  become  more  important  as  the 
microprocessors  become  more  and  more  common  as  building  blocks  for  the  design  and  construction  of  powerful 
application-oriented  or  general  purpose  computers.  Along  with  them,  a full  complement  of  system  software 
production  aids  such  as  cross-meta  compilers,  cross-meta  assemblers  are  needed  so  that  appropriate  soft- 
ware processors  can  be  developed  and  checked  out  on  the  simulators  even  before  the  computer  system  is 
constructed.  In  this  way,  a very  realistic  measure  of  all  system  parameters  and  their  cost-effectiveness 
can  be  obtained  very  early  in  the  design  phases  before  a major  budget  is  committed  to  the  project. 
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coriCLUsioris 


The  revolution  of  the  microprocessor  at  low  cost  shows  a new  trend  in  the  computer  architecture.  As  the 
microprocessors  become  the  building  blocks  for  computer  systems,  the  gap  between  the  component  designer 
and  the  system  designer  is  becoming  narrower.  However,  before  a microprocessor  network  as  a general  pur- 
pose system  becomes  a reality,  there  are  several  other  issues  that  need  to  be  investigated.  First,  inves- 
tigation on  the  architecture  for  future  microprocessors:  The  current  day  microprocessors  are  limited  in 
capabilities  and  power.  In  order  to  overcome  these  limitations  the  future  microprocessors  should  have 
features  to  provide  flexibility  and  performance  improvements.  These  features  include  powerful  communica- 
tion capabilities  to  include  protocols  such  as  SDLC,  higher  data  processing  bandwidth,  language  support 
features,  OP  code  flexibility  and  larger  addressing  range.  Secondly,  tools  to  describe,  simulate,  emu- 
late and  design  a multi-nicrocomputer  system  need  to  be  developed.  Tools  such  as  SUtlC  or  INTERSIO  as 
general  purpose  simulation  packages  provide  powerful  capabilities  in  this  direction.  Along  with  them, 
a complete  set  of  software  production  aids  such  as  cross-meta  compilers  are  needed.  Finally  very  little 
work  is  done  on  the  testing  side,  and  to  circumvent  the  difficulty  of  testing  tools  integrated  with  the 
system  development  are  needed.  All  these  issues  are  to  be  studied  thoroughly  before  a major  commitment 
is  made  to  design  a microprocessor  based  system. 
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ABSTRACT:  We  describe  a general-purpose,  asynchronous,  lexical  and  syntactic 
processor  for  use  in  the  compilation  of  computer  programs.  It  is  table-driven, 
with  the  capability  of  translating  different  languages  by  loading  the  processor 
with  different  tables.  Automatic  methods  (i.e.,  computer  algorithms)  exist  to  ^ 

generate  these  tables,  and  these  algorithms  are  practical  both  in  time  and  space.  ( 

The  processor  is  designed  to  use  bipolar  bit-sliced  microprocessors,  and  consists  i| 

of  two  independent  units;  a lexical  analyzer  (scanner)  and  a syntactic  analyzer 
(parser),  which  communicate  via  a shared  latch.  We  argue  that  the  time  and  space  ^ 

benefits  of  such  a processor  make  it  an  attractive  addition  to  any  computer. 
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SCANNING  AND  PARSING  PROCESSOR 


INTRODUCTION 

This  paper  describes  the  front  end  of  a compiler  (lexical  scan  and 
syntactic  parse)  Implemented  as  a microprogrammed  processor  to  be  placed 
between  memory  and  the  central  processor  of  a computer.  The  generalized 
scanning  and  parsing  algorithms  are  microprogrammed  and  are  table-driven 
by  data  In  local  RAMS  that  may  be  reloaded  for  each  language  to  be  trans- 
lated. The  Input  to  the  pre-processor  Is  the  character  stream  of  the 
source  program  code.  Output  is  in  the  form  of  two  data  streams:  1)  the 
production  numbers  realized  in  a left  to  right  bottom-up  parse  (used  to 
drive  the  compiler)  and  2)  the  character  strings  that  are  the  tokens  being 
recognized  (used  in  subsequent  compilation  steps).  The  pre-processor  is 
organized  into  two  distinct  processors  that  execute  in  parallel  with  their 
own  local  memories  and  communicate  via  a dedicated  interrupt  system.  The 
scanner  delimits  input  characters  into  token  character  strings  and  delivers  a 
token  code  number  to  the  parser  via  a shared  buffer  (TOKEN  REGISTER).  The 
parser  consumes  the  tokens  and  identifies  context-free  grammar  reductions 
which  are  sent  to  a compiler  executing  in  the  main  processor.  This  organiza- 
tion is  sho\m  in  figure  1. 

The  compiler  writer  needs  only  to  specify  the  syntax  of  the  language  in 
BNF  and  the  tokens  as  regular  expressions.  We  have  written  computer  algo- 
rithms to  convert  these  specifications  to  tables,  which  are  loaded  into  the 
RAMS  of  the  scanner  and  parser.  The  compiler  writer  is  then  able  to  concen- 
trate on  the  semantics  of  the  language,  secure  in  the  knowledge  that  the 
scanning  and  parsing  will  be  handled  in  exactly  the  manner  specified.  This 
is  a distinct  advantage  over  using  ad  hoc  scanning  and  parsing  methods. 


101 


w 


The  scanner  and  parser  operate  asynchronously  in  parallel.  One 
shared  buffer  is  placed  between  the  units  into  which  the  scanner  deposits 
token  codes  or  error  messages  later  consumed  by  the  parser.  The  latch  buf- 
fer contains  an  interrupt  output  line  that  is  set  high  by  the  parser  when 
it  clears  the  buffer  and  is  pulled  low  again  when  the  scanner  writes  a new 
token.  Both  units  latch  the  interrupt  line  in  their  micro-sequencers  and 
must  test  it  prior  to  any  buffer  usage.  The  character  string  out  from  the 
scanner  to  the  main  processor  follows  the  same  protocol. 

The  scanner  delivers  two  special  codes  to  the  parser.  Zero  indicates  a 
scanning  error  (this  is  determined  from  the  tables)  and  255  marks  the  end  of 
transmission  (meaning  that  the  source  code  end  marker  has  been  read).  At 
the  end  of  transmission,  the  scanner  and  parser  go  into  an  idle  loop  and 
require  a pulse  on  the  START  line  to  be  reactivated. 

The  processor  was  designed  using  Intel's  3000  series  bipolar,  bit- 
sliced  microprocessors.  They  were  chosen  for  their  high  speed  and  large 
number  of  processor  inputs  and  outputs.  Each  part  of  the  processor  consists 
of  an  8-bit  central  processing  element  (CPE)  constructed  from  2-bit  slices, 
a look-ahead  carry  generator,  a micro  instruction  sequencer,  512  x 8-bit 
ROMS,  and  4K  bit  RAMS.  The  processors  share  an  Intel  3212  latch  for  inter- 
nal communication,  and  use  more  3212 's  for  buffering  output  to  the  main  pro- 
cessor. The  3212  latches  are  well  suited  for  this  purpose,  since  they  can 
be  polled  by  either  producer  or  consumer  for  empty/ full  conditions.  The 
scanner  accepts  the  program  stream  through  a FIFO  buffer. 

THE  SCANNER 

Lexical  analysis  in  compiler  or  translator  terminology  refers  to  the 
partitioning  of  an  input  string  into  non-overlapping  character  strings  that 
represent  the  lexicons  or  tokens  of  the  source  language.  In  terms  of  formal 
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language  theory  (AD,  1972),  tokens  are  the  terminal  S3mibols  of  a formal 
syntax  language.  In  syntactic  analysis,  there  is  no  distinction  between 
two  separate  instances  of  a given  terminal  symbol  despite  possible  charac- 
ter string  differences.  (E.g.,  all  possible  integer  constants  may  be  rep- 

[ 

resented  by  the  terminal  symbol  INTjCONST. ) 

Due  to  these  differences,  a terminal  S3mibol  is  actually  a set,  or 

! 

[ language,  of  character  strings.  Thus,  programming  languages  are  normally 

specified  in  terms  of  two  levels  of  languages,  the  token  languages  and  the 
syntax  language.  The  separation  exists  for  two  main  purposes.  The  first 
is  to  provide  a simpler  vocabulary  for  the  syntax.  The  second  and  more 
important  reason  is  to  allow  two  distinct  recognition  methods  for  the  two 
levels,  each  tailored  to  best  handle  (via  appropriate  space  and  time  opti- 
mizations) its  input  language. 

Tokens,  which  generally  include  identifiers,  literals,  and  delimiters, 
are  assumed  to  be  regular  languages  (GRE,  1971).  A recognizer  designed  just 
for  regular  languages  is  the  finite  automaton  (HU,  1969).  A finite  automaton 
is  a machine  that  moves  through  a sequence  of  states  (finite  in  number)  in 
a state  graph. 

Transitions  in  the  graph  are  directed  arcs  labelled  with  a set  of 
characters,  the  recognition  of  which  allows  the  transition  to  be  performed. 

If  the  automaton  is  deterministic,  each  character  in  the  input  alphabet  is 
allowed  to  select  only  a single  transition.  Certain  states  are  marked  as  final 
states  that  signal  recognition  of  a member  of  the  re^^ular  set, 

1 

' A deterministic  finite  automaton  state  graph  example  follows: 

1 
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When  a deterministic  finite  automaton  is  used  for  scanning  a source 
program,  the  final  states  of  the  machine  are  labelled  with  the  recognized 
token  set.  When  the  automaton  enters  a final  state  and  the  next  input 
character  does  not  label  a transition  out  of  the  state,  a code  for  the  token 
is  emitted  and  the  automaton  is  restarted  in  its  initial  state.  The  auto- 
maton acts  as  a transducer  in  this  respect.  If  there  is  any  other  point  at 
which  the  next  input  character  does  not  select  a transition,  an  error  code 
can  be  emitted  instead.  The  string  that  makes  up  the  token  is  often  required 
for  later  analysis.  The  automaton  can  have  its  transitions  labelled  with 
flags  denoting  whether  or  not  the  character  causing  the  transition  should  be 
concatenated  to  the  string  being  built  as  the  current  token. 

' To  implement  a finite  automaton  model,  a simple  algorithm  exists  which 

is  driven  by  a two-dimensional  table.  The  rows  of  the  table  represent  the 
states  of  the  machine  and  the  columns  correspond  to  the  sets  of  characters 
that  are  distinguished  by  the  lexical  analysis  unit  or  scanner.  Each  table 
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entry  can  be  characterized  by  the  following  PASCAL  record  which  denotes  the 
possible  next  actions  to  be  taken  by  the  scanner: 

TableEntry  : RECORD 

CharacterDispositlon  : (Emit, Forget) ; 

CASE  Final State  : BOOLEAN  OF 
TRUE:  RECORD 

TokenCode  : INTEGER 
LookAhead  : BOOLEAN 
END 

FALSE:  NextState  : INTEGER 
END; 

In  the  TableEntry,  CharacterDispositlon  is  used  to  decide  ^en  the  transi- 
tion character  is  to  be  added  to  the  token  string  or  deleted.  When  a state  is 
a final  state,  the  TokenCode  is  emitted  separately  from  the  character  and  the 
transition  character  must  be  scanned  again  if  LookAhead  is  true.  Otherwise 
another  transition  is  to  be  followed  and  the  new  state  entered  is  NextState. 

Such  a table-driven  scanner  can  be  efficiently  implemented  in  flexible 
firmware.  The  scanning  tables  for  a particular  set  of  token  languages  (for 
a particular  programming  language)  are  loaded  into  a series  of  4096-bit 
RAM's  of  64  rows  and  64  columns.  In  this  design,  ten  such  RAM's  are  used  and 
eight  bits  of  each  table  entry  can  be  used  for  the  NextState  or  TokenCode. 

Two  bits  suffice  for  encoding  the  other  fields.  One  input  port  is  necessary 
to  accept  the  source  language  character  string.  One  output  port  is  used  to 
emit  token  codes  to  the  parsing  unit;  another  to  emit  the  characters  to  be 
concatenated  into  token  strings. 

There  is  a special  case  in  token  recognition  that  must  be  considered 
for  time  and  space  efficiency  in  a firmware  Implementation  of  a scanner. 

There  often  occur  several  dedicated  character  strings  that  are  formally  mem- 
bers of  a particular  token  language  but  used  as  distinct  tokens.  The  m^ 4t  ' 
common  example  is  the  use  of  selected  "identifiers"  to  denote  delimiters  for 

fl 
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which  special  symbols  do  not  exist.  BEGIN,  INTEGER,  and  END  are  three 
examples  from  PASCAL  that  fit  the  description  of  "identifier"  except  for 
their  being  reserved  as  symbols.  Such  exceptions  can  be  grouped  under  a 
classification  of  reserved  words.  Reserved  word  recognition  can  be  built 
into  the  table  structure  described  but  tends  to  make  the  table  sparse  and 
very  large.  As  an  alternative,  a pattern  matching  algorithm  can  be  formu- 
lated and  the  reserved  strings  can  be  stored  in  a tightly  packed  represen- 
tation. In  the  implementation,  several  rows  of  the  RAM  are  dedicated  to 
the  reserved  word  encodings. 

The  scan  unit  is  organized  as  shown  in  figure  3.  The  ROM  contains 
the  microprogrammed  version  of  the  following  general  algorithm  in  which 
input  and  output  and  a procedure  call  are  replaced  by  comments. 

START  : STATE :=C; 

LOOP  : (*  fetch  input  character  C *) ; 

CN : =CHARACTERSETNUM | C | ; 

ENTRY:=TABLE| STATE, CN| ; 

IF  ENTRY. FinalState  then  begin 

(*  send  ENTRY. TokenCode  to  parser  *) ; 
if  not  ENTRY. LookAhead 

then  (*  send  character  C *) 
else  (*  reset  scan  pointer  to  reuse  C *); 
goto  START 
end  else  begin 

if  ENTRY. CharacterDisposition  = Emit 
then  (*  send  character  C *) ; 

STATE : =ENTRY. Next  State ; 
if  STATE  >TABLEND 

then  (*  start  reserved  word  search  *) 
else  goto  LOOP 
end; 

In  the  implementation,  the  high  order  bits  of  each  data  word  (either  table 
entry,  character  set  code,  or  reserved  word  character  or  mark)  are  directly 
routed  into  the  3001  micro-sequencer.  The  other  data  bits  enter  one  of  the 
four  two-bit-sliced  processor  registers.  They  are  either  compared  to  the 
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input  character  (in  the  reserved  word  match),  buffered  to  be  sent  to  the 
parser  (as  a token  code),  used  as  a row  selector  in  the  RAM  (as  a next  state), 
or  used  as  a column  selector  in  the  RAM  (when  data  is  a character  or  charac- 
ter set  code).  The  character  input  from  a FIFO  buffer  is  saved  in  a register 
of  the  processor  and  used  as  a RAM  column  selector  to  find  the  character  set 
and  perhaps  sent  to  the  character  output  stream. 

Whenever  a character,  token  code,  or  error  code  is  sent  to  an  external 
device  or  the  asynchronous  parsing  unit,  the  scanner  must  poll  the  interrupt 
line  associated  with  the  shared  latch  to  see  if  it  is  currently  holding  un- 
processed data.  For  this  reason,  the  interrupt  lines  feed  directly  into  the 
3001  micro-sequencer  as  well.  When  an  input  stream  character  is  finally  con- 
sumed, a new  input  character  request  is  sent.  The  scanner  loops  while  it 
awaits  an  initial  START  pulse  or  the  latch-ready  signal  when  output  is 
buffered. 

The  microinstructions  contain  five  fields:  function,  carry- in,  mask, 
latch  select,  jump.  The  function  is  the  instruction  code  of  the  3002 
processor  slices.  Carry- in  sets  the  least  significant  carry  for  the  3003 
carry  lookahead  generator.  The  mask  is  used  in  various  ways  to  enable 
selective  views  of  the  data  bits  and  give  a wider  variety  of  processor 
functions.  This  aspect  of  the  Intel  processor  proved  very  fruitful.  The 
latch  select  lines  are  used  to  control  the  input  and  output  latches  as 
well  as  the  memory  address  latches,  since  busses  are  used  to  overlap  line 
usage.  Finally,  the  jump  field  is  the  3001  sequencer  next  address  instruc- 
tion code.  Four  5i2-word  8-bit  ROM  units  are  used  to  contain  the  instruc- 
tions. Approximately  25%  of  the  ROM  words  are  used  as  instructions.  The 


compactness 
if  needed. 


allows  room  for  enhancement  in  the  capability  of  the  scanner 
The  utilization  of  the  RAM  depends  on  the  language  to  be 
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translated. 
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THE  PARSER 


Parsing  is  the  determination  of  the  structure  of  an  element  of  a language. 
For  computer  languages,  the  structure  is  specified  in  terms  of  a content-free 
grammer,  such  as  by  using  BNF.  In  the  following  discussion,  LHS(l)  denotes 
the  left-hand  side  of  production  1,  and  RHS  denotes  the  right-hand  side  of 
production  1.  The  parser  recognizes  when  a production  in  the  BNF  is  appli- 
cable, and  returns  the  number  of  the  production  to  the  compiler. 

The  algorithm  used  in  this  parsing  processor  is  a modification  of  LR 
parsing  first  described  by  Knuth  (KNU,  1965).  LR  is  a powerful  parsing 
method,  but  building  the  tables  is  expensive  in  both  time  and  space.  There- 
fore, the  tables  most  commonly  generated  are  not  LR(1)  tables,  but  SLR(l) 

(DER,  1971)  or  LALR(l)  (AJ  , 1974)  tables  instead.  These  methods  are  based 
on  LR(0)  parsing  methods,  with  less  precise  look-ahead  sets  than  LR(1)  uses. 
Tables  generated  by  LR(0),  SLR(l),  lALR(l)  or  LR(1)  can  each  be  used  by  the 
parsing  processor;  the  parsing  algorithm  is  the  same  for  all  of  them. 

The  parsing  tables  describe  actions  for  a push-down  automaton  whose  stack 
is  used  to  keep  track  of  the  states  visited.  When  an  input  token  is  received, 
the  action  for  the  (state, token)  pair  is  looked  up  and  executed  (the  current 
state  is  kept  on  the  top  of  the  stack).  There  are  four  possible  actions: 

1)  Transition 

2)  R reduction  1 

3)  L reduction  j 


Push  the  new  state  specified  by  the  action,  and 
get  another  token. 

Pop  SIZE(RHS(1))-1  states  from  the  stack  and  use 
LHS(l)  as  the  next  input  token,  OUTPUT  1. 

Pop  SIZE  (RHS(j))  states  from  the  stack  and  use 
LHS(j)  as  the  next  input  token.  Return  the  cur- 
rent input  token  to  the  input  stream  (it  was  used 
as  look-ahead).  OUTPUT  j. 
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4)  Error 


Signal  a syntax  error  and  call  compiler  to  invoke 
error  recovery  or  correction. 


where  'DUTPUT  l"  means  "reduction  1 has  been  recognized",  and  1 is  sent  to  the 
compiler. 

The  parsing  algorithm  is: 

Push  State  1 on  the  stack. 

Get  the  first  input  token. 

DO  ACTION  (state  on  top  of  stack,  input  token) 

UNTIL  stack  is  empty. 

We  have  written  an  ALGOL  60  program  which  produces  tables  of  these  actions, 
given  a grammar  in  BNF  as  input.  An  example  of  the  parsing  algorithm  is  given 
at  the  end  of  the  paper. 

Parser  Implementation 

The  processor  is  8 bits  wide  (four  2-bit  slices).  This  width  was  chosen 
since  that  was  judged  large  enough  to  contain  the  information  stored  in  the 
RAM.  The  RAM  is  10  bits  wide,  but  the  2-bit  action  code  is  not  returned  to  the 
CPE.  It  is  directly  connected  to  a latch  in  the  micro-sequencer,  thus  speed- 
ing up  processing.  The  only  quantities  that  are  larger  than  8 bits  are  the 
addresses  for  the  RAM;  they  are  calculated  byte-serially  and  held  by  a latch 
until  all  13  bits  are  present  (a  look-ahead  carry  generator  is  used  to  increase 
addition  speeds).  The  CPE  has  two  outputs,  one  used  for  sending  addresses  to 
the  RAM,  and  the  other  connected  to  the  parser  output  latch.  The  3002  CPE  also 
has  2 sets  of  inputs,  one  for  the  RAM  data,  and  one  for  input  from  the  scanner. 
The  latter  is  latched  and  can  be  interrogated  to  determine  whether  data  is 
present.  Internal  registers  of  the  CPE  are  used  to  save  frequently  accessed 
quantities:  the  top  of  stack  address,  a pointer  to  the  (IiIS,RHS  length)  pairs; 
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a pointer  to  the  current  state,  and  the  current  token.  An  overall  view  of  the 
parser  architecture  is  shown  in  figure  4. 

The  instruction  set  of  the  3002  CPE  is  powerful  enough  for  this  applica- 
tion, and  contains  features  which  made  space  efficient  programming  easy.  We 
used  a 24-bit  instruction  for  the  parsing  processor.  The  CPE's  ability  to 
mask  data  on  the  input  bus  proved  to  be  very  useful.  This  mask  was  present 
in  each  instruction,  and  could  be  used  to  introduce  constants  into  the  CPE. 

The  instructions  also  contain  the  jump  field,  function,  RAM  read/write  enable 
and  latch  select  bits.  The  jump  field  of  the  3001  micro- sequencer  was  at  first 
thought  to  be  a major  drawback,  but  the  resulting  code  turned  out  to  be  quite 
dense.  The  code  for  the  parser  occupies  only  half  of  each  8 x 512-bit  ROM. 

Using  the  Parser 

To  use  the  parser,  one  first  specifies  the  grammar  in  BNF  and  modifies 
it,  if  necessary,  until  it  is  in  lALR(l)  form  (typically  a rather  simple 
process).  The  tables  are  then  encoded  to  distinguish  the  actions  (upper  2 
bits  distinguish  the  type  of  action,  lower  8 bits  give  the  reduction  or  state), 
and  converted  into  an  array  of  lists.  The  LHS  and  RHS  lengths  are  determined, 
and  the  tables  are  loaded  into  the  RAM.  The  parser  is  begun  with  a START 
pulse,  awaits  tokens  from  the  scanner,  and  begins  parsing.  The  parser  outputs 
a stream  of  reduction  numbers,  and  issues  a code  of  0 to  signal  the  end  of  the 
parse.  There  are  four  error  codes  which  are  returned  as  pseudo-reductions  with 
very  high  numbers. 

ERROR  CODE  MEANING 

255  Parse  is  over  (e.g.,  an  entire  correct  program  has  been 

recognized),  but  the  scanner  is  still  sending  more  tokens. 

The  parser  continues  to  output  255  until  an  end  of  input 
is  received. 
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ERROR  CODE 


MEANING 


254  Scanner  error.  Enter  RESTART  mode. 

253  Syntax  error.  Enter  RESTART  mode. 

252  Stack  overflow  in  the  RAM.  This  error  is  fatal,  and 

only  total  restart  is  possible. 

The  user  may  wish  to  modify  the  stack  before  restarting  the  parse  for  error 
recovery.  For  this  reason  the  address  of  the  top  of  stack  is  saved  in  the 
RAM  (pointed  to  by  locations  0 and  1),  and  is  reread  from  there  by  the  parsing 
unit  on  a RESTART. 

The  RAM  organization  is  described  in  figure  2.  As  mentioned  earlier, 
locations  0 and  1 contain  the  address  of  the  bottom  of  stack,  which  contains 
the  current  top  of  stack  address.  Next  are  the  pointers  to  the  beginning  of 
the  (token, action)  list  for  each  state,  followed  by  the  (LHS,RHS  length)  array 
The  (token, action)  pairs  follow,  with  each  list  terminated  by  a zero  action. 
The  top  of  stack  address,  and  the  stack  fill  the  rest  of  the  RAM.  There  are 
20  4R-bit  memory  planes  in  the  RAM,  providing  sufficient  space  for  typical 
production  compilers.  For  example,  the  PASCAL  compiler  under  development  at 
the  University  of  Wisconsin  (FL,  1977)  uses  a software  lALR(l)  parser.  There 
are  222  productions,  204  states,  and  the  matrix  has  1900  non-zero  entries. 

This  means  that  approximately  5000  bytes  of  RAM  would  be  used  for  tables  in 
our  design,  leaving  over  3K  bytes  for  the  stack.  Since  the  stack  depth 
typically  never  exceeds  30,  this  allows  sufficient  space  for  "practical" 
compilers. 

EVALUATION 

A 100  nanosecond  clock  was  chosen  for  both  the  scanner  and  the  parser. 

In  the  scanner,  between  20  and  50  micro-instructions  are  needed  to  process 


each  Input  character  (assuming  no  delays  due  to  the  output  buffers).  Therefore, 
the  scanner  requires  a new  input  character  every  2 to  5 microseconds.  This  is 
a modest  rate  for  an  interleaved  main  memory,  so  it  is  expected  that  the  FIFO 
input  buffer  will  always  be  full.  If  we  assume  an  average  of  10  characters 
per  token,  the  scanner  can  deliver  tokens  approximately  every  50  microseconds. 
This  is  an  effective  rate  of  20,000  tokens  per  second,  or  2500  eighty-character 
source  code  lines  per  second. 

A typical  fast  compiler  running  on  a large  computer  will  ccxnpile  10,000 
source  lines  per  minute.  Since  most  compilers  bottleneck  on  the  scanner, 
breaking  this  bottleneck  will  speed  up  the  compiler  so  much  that  other  com- 
ponents can  be  coded  in  a less  than  optimal  manner  (e.g.,  in  a high  level 
language  with  run-time  checks). 

Following  are  the  times  calculated  for  each  phase  of  the  main  parsing 
loop  (in  nanoseconds) : 

Get  a token  600 

Get  the  list  for  a state  1500 

Find  the  action  900+1000 (#  of  entries  searched) 

Transition  to  a new  state  600 
Reduction  420 

These  times  are  exclusive  of  any  delays  in  receiving  tokens  from  the  scanner, 
or  waiting  for  the  output  latch  to  be  emptied.  It  is  reasonable  to  assume 
that  the  average  list  has  ten  elements,  so  about  five  entries  will  be  searched 
to  get  a match.  There  are  approximately  as  many  transitions  as  reductions  in  a 
program  parse,  so  an  approximation  of  the  average  time  between  reductions  is 
600  + 1500  + 900  + 1600(5)  + 800  + 4200  = 16000  nanoseconds,  or  16  micro- 
seconds per  reduction.  With  this  speed,  it  is  probable  that  neither  the 
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scanner  nor  the  main  processor  will  be  able  to  keep  up  with  the  parser.  We 
have  also  freed  the  table  space  for  the  parser,  so  the  compiler  can  run  In 
less  space  (5K  in  PASCAL)  than  if  the  tables  were  resident  in  main  memory. 
This  would  be  even  more  important  on  small  computers,  vdiere  the  address  space 
may  be  limited. 

Experience  with  the  University  of  Wisconsin  PASCAL  compiler  shows  that 
about  50%  of  the  compilation  time  is  spent  in  the  software  scanner  and  parser 
Thus,  we  conclude  that  if  such  a parsing  and  scanning  unit  were  added  to  a 
main  processor,  compilation  could  be  speeded  up  by  approximately  a factor  of 
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ABSTRACT 


This  paper  presents  a new  distributed  computer  network  struc- 
ture appropriate  for  a network  of  microprocessors.  The  new  network 
structure  combines  advantages  of  a ring  structure;  simplicity,  high 
line  utilization,  concurrent  service,  distributed  control  informa- 
tion, minimum  delay  for  minimum  cost,  and  high  reliability.  This  is 
accomplished  using  two  loops.  The  "inner"  loop  is  for  data  transfer. 
It  is  partitioned  into  N buses  interconnecting  N microprocessors. 

The  "outer"  loop  is  for  control  information  to  pass  along  under  the 
guidance  of  a bus  controller.  Results  for  simulations  of  contemporary 
proposals  (Pierce,  Newhall,  and  Reames  et  al.)  and  the  new  network 
proposed  in  this  paper  show  that  the  new  structure  substantially 
improves  throughput  when  compared  to  the  other  structures. 


INTRODUCTION: 


Researchers  have  proposed  distribution  of  low-cost  computing 
processors  throughout  a network  as  an  alternative  to  expensive  and 
highly  centralized  computer  systems  (SPAN  76).  The  results  have 
shown  that  completely  distributed  systems  lead  to  a great  deal  of 
inefficiency  due  to  increased  hardware  and  software  overhead  and 
often  fail  to  deliver  acceptable  throughput  as  expected.  In  addi- 
tion a computer"  network  introduces  other  complexities  concerning 
deadlocks,  network  reliability,  traffic  regulation,  and  scheduling. 

This  paper  introduces  a new  network  topology  with  high  through- 
put rate  for  distributed  computer  systems.  The  network  has  an  im- 
proved response  time,  greater  throughput,  and  is  more  reliable  than 
the  Pierce,  Newhall,  or  Reames  - Liu  loop  network  topologies. 

I.  DESIGN  PHILOSOPHY 

A distributed  computer  system  interconnects  several  hetero- 
geneous or  homogeneous  nodes  which  communicate  with  each  other 
through  network  media.  A heterogeneous  network  is  a collection  of 
architecturally  different  nodes  while  a homogeneous  network  is  a 
collection  of  architecturally  similar  processor  nodes. 

Farber  (FARE  72)  lists  the  motivations  to  develop  a distributed 
computer  system  as  any  or  all  of  the  following: 

1)  Modular  Growth 

2)  System  Reliability 

3)  Incremental  Upgrading  of  Processor  Nodes 

4)  Dynamic  Restructing 

5)  Decreased  Design  Time 

6)  Ease  of  System  Validation 

In  addition  we  include: 

7)  Tailored  Design  to  the  Users  Needs 

8)  Better  Throughput  (Speed) 

9)  Less  Cost 


T 


With  these  motivations  in  mind,  several  people  have  proposed 
and  implemented  a variety  of  network  topologies  in  hopes  of  effi- 
ciently managing  distributed  computer  systems. 

The  topology  of  the  interconnections  in  a network  is  of  great 
concern  since  it  has  a major  effect  on  the  performance  of  the  dis- 
tributed system.  The  most  highly  connected  network  is  to  connect 
every  computer  to  every  other  directly.  This  involves  N(N-l)/2 
interconnections  for  N nodes  and  is  very  costly  unless  N is 
very  small.  A less  costly  topology  requiring  N interconnections 
and  an  additional  central  control  processor  is  the  star  configuration. 
The  central  control  computer  provides  node-to-node  interconnection 
by  switching  from  one  node  interconnection  pattern  to  another  upon 
demand.  Furthermore,  each  distributed  star  computer  system  can  be 
connected  to  another  star  computer  system  by  connecting  the  two 
central  control  computers  together,  and  with  appropriate  control 
algorithms,  this  will  allow  any  node  in  either  subnetwork  to  com- 
municate with  any  others. 

A problem  with  star  network  computer  distribution  is  reli- 
ability of  the  system,  for  as  soon  as  the  central  computer  exhibits 
faulty  functions,  the  whole  system  breaks  down.  In  addition,  the 
central  control  processor  is  an  overhead  cost  added  to  the  whole 
system.  If  the  number  of  nodes  around  the  central  processor  is 
small,  then  the  advantage  of  this  system  is  its  speed,  also  because 
the  links  between  computers  are  bidirectional,  the  system  has  a 
very  good  throughput.  We  will  not  include  the  star  network  in  the 
work  reported  here  because  of  its  poor  reliability  (STRE  76). 

Another  philosophy  is  to  connect  all  the  processor  nodes  in  a 
loop  or  ring  configuration.  This  is  called  a loosely  coupled 
connection  since  each  node  is  connected  to  others  by  only  two  links, 
an  input  link  which  comes  to  the  node  and  an  output  link  that  goes 
away  from  the  node.  Loop  systems  are  attractive  for  mini-micro 
computer  networks  due  to  their  possible  high  line  utilization  and 
because  they  are  simple.  This  last  philosophy  has  attracted  the 
attention  of  many  researchers  who  have  designed  a variety  of  network 
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systems  based  on  the  simplicity  of  a loop.  The  new  loop  structure 
will  allow  more  parallel  communications  between  nodes,  while  taking 
advantage  of  loop  simplicity. 

II.  PREVIOUS  LOOP  CONTROL  ALGORITHM 

The  first  loop  structure  system  was  suggested  by  NEV/HALL 
(FARM  69).  In  the  NEWHALL  loop  a round-robin  control  passing 
mechanism  circulates  around  the  loop  and  allows  only  one  node  at 
a time  to  transmit  one  or  more  messages  through  the  loop.  There- 
fore, the  rest  of  the  nodes  have  to  wait  and  this  causes  a queuing 
time  in  sending  the  messages  which  limits  the  achievable  loop 
utilization. 

A version  of  a loop  discipline  similar  to  the  NEWHALL  discipline 
is  allowed  with  IBM's  SDLC  (BONN  74)  (or  with  the  largely  equivalent 
HDLC  (DAVI  73)).  In  this  discipline  a central  controller  originally 
sends  a poll  command  around  the  loop.  The  first  attached  device 
wishing  to  transmit  is  thereby  enabled  to  transmit.  This  device 
then  ends  its  transmission  by  passing  the  poll  on,  so  that  control 
passes  around  the  loop  in  a manner  similar  to  the  behavior  of  a 
NEV7HALL  loop.  This  variation  is  not  explicitly  studied  here  because 
of  its  similarity  to  the  NEV/HALL  loop.  On  the  other  hand.  Pierce 
(PIER  72)  introduced  a new  mechanism  that  improves  network  utili- 
zation by  time  multiplexing  the  loop.  That  is,  the  information  sent 
around  the  loop  is  divided  into  fixed-size  packets  and  to  send  a 
message,  each  node  checks  for  an  empty  packet  before  transferring 
all  or  part  of  its  message.  If  a message  is  smaller  than  the 
fixed-size  packet,  the  excess  space  is  wasted.  If  the  message  is 
too  large  to  fit  the  packet,  then  the  message  is  broken  into  two 
or  more  packet-sized  messages.  When  a processor  node  transmits 
a message,  it  must  first  check  whether  the  next  packet  or  time 
slot  passing  by  it  is  empty.  If  it  is,  control  will  pass  to  the 
processor  nodes  transmitter  to  see  if  there  is  any  information  to 
be  transmitted.  In  case  the  packet  is  not  empty,  the  processor 
node  checks  to  see  if  the  destination  address  in  the  packet  matches 


the  node  addx^’ess.  If  so,  the  processor  node  transfers  the  packet 
information  into  its  buffer.  If  the  packet  address  does  not  match 
the  processor  node  address,  then  the  processor  simply  passes  this 
packet  to  the  next  node.  The  transmission  mechanism  is  as  simple 
as  waiting  for  the  beginning  of  an  empty  slot  and  filling  it  with 
a packet,  but  disadvantages  of  this  system  include: 

a)  problem  of  dividing  messages  into  packets 

b)  problem  of  packet  reassembly  which  occurs  when  messages 
are  divided  into  packets  and  then  sent  separately,  so  a 
sorting  problem  arises. 

c)  messages  do  not  always  fit  into  a fixed  number  of  packets, 
so  there  are  some  partially  empty  packets  with  corresponding 
waste  of  network  capacity. 

Therefore,  neither  Newhall  or  Pierce  loops  make  very  efficient 
use  of  loop  topology.  Reames  and  Liu  (REAM  75)  introduced  a new 
message  transmission  mechanism  called  DLCN  (Distributed  Loop  Computer 
Network)  which  allows  multiple  messages  in  the  loop  as  the  Pierce 
loop  does  and  messages  of  variable  length  as  the  Newhall  loop  permits. 

DLCN  incorporates  a variable  length  shift  register  before  each 
node’s  transmitter,  see  Figure  1.  A message  can  be  transmitted 
through  the  loop  whenever  no  other  message  transmission  is  already 
in  progress,  or  no  other  messages  have  started  passing  that  node. 

In  this  case,  the  variable  shift  register  provides  a delay  in  the 
incoming  message  equal  in  size  to  at  least  the  size  of  the  message 
to  be  inserted.  Once  an  incoming  message  has  been  delayed  in  this 
manner,  it  is  transmitted  ahead  of  any  incoming  messages  which  are 
in  turn  delayed  during  the  time  needed  to  transmit.  The  contents 
of  the  variable  length  shift  register  v;ill  gradually  decrease  in 
length  and  finally  be  eliminated  if  there  is  not  enough  traffic. 

DLCN  actually  combines  Newhall  and  Pierce  loop  advantages  by 
allowing  simultaneous  message  arrival  with  message  transmission, 
and  also  provides  automatic  traffic  regulation  based  on  observed 
system  load,  but  DLCN  favors  infrequent  requests  while  delaying 
more  frequent  requests  for  network  service. 
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A disadvantage  of  the  DLCN  is  the  complexity  of  interface 
mechanism  and,  therefore,  the  cost  to  build  such  an  interface. 

Secondly,  inserting  a variable  shift  register  at  each  node  lovjers 
the  reliability  of  the  overall  loop  since  it  adds  one  new  possible 
failure  mode.  Also,  when  the  number  of  nodes  in  the  loop  increases, 
eventually  the  queuing  time  will  increase  drastically.  This  limits 
the  number  of  nodes  inserted  in  a loop. 

Potvin  (POTV  71)  introduced  a generalized  distributed  computer 
system  called  the  star  ring  system.  It  combines  the  control  feature 
of  a loop  network  with  the  message  transmission  features  of  a star 
network.  It  is  somewhat  similar  to  Newhall's  technique  for  passing 
control  along  its  loop  and  in  its  method  of  time  multiplexing  message 
transmission.  The  system  is  restricted  by  the  number  of  nodes  on  the 
loop  because  the  central  star  ring  is  common  to  all  the  nodes  and,  | 

therefore,  not  more  than  two  nodes  can  talk  to  each  other  at  any  time.  | 

This  slows  the  throughput  of  the  system  by  a great  amount.  Potvin 
considers  only  a very  small  number  of  nodes  in  the  network.  ; 

All  the  above  communication  loops  suffer  from  the  following  | 

common  shortcomings  in  addition  to  the  problems  discussed  above.  ' 

1)  The  stream  of  data  is  in  one  direction  and  therefore,  j 

sometimes  the  transmission  of  data  from  one  node  to  its  i 

neighbor  node  takes  place  through  the  rest  of  the  nodes  i 

causing  more  delay  and  less  reliability  than  necessary. 

2)  If  a node  starts  sending  a stream  of  messages  to  another 
node  it  will  block  out  all  other  transmissions  and  network 
performance  will  decrease  by  a great  amount.  Thus,  the 
networks  mentioned  above  are  sensitive  to  local  demands 
that  affect  the  performance  of  all  nodes. 

3)  If  there  are  errors  in  the  address  fields  of  the  message 
and/or  a node  fails  to  function  properly,  messages  will 
saturate  the  loop,  in  all  the  above  systems.  Several 
different  techniques  have  been  used  to  recover  from  errors, 
but  this  eventually  slows  down  loop  communication. 
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4)  If  there  is  a failure  in  the  loop,  the  whole  network  will 
fail  to  operate. 

A new  experimental  loop  is  proposed  that  will  enable  the  whole 
network  to  recover  from  the  above  shortcomings.  The  concept  is  to 
distribute  data  and  control  into  two  different  loops  (a  data  loop 
and  a controller  loop).  The  data  loop  is  actually  a segmented 
loop  consisting  of  a single  segment  connecting  nodes.  Each  node 
is  interfaced  to  the  loops  by  a switch  that  may  be  turned  "on"  or 
"off".  The  control  loop  operates  by  a simple  arbiter,  which  accepts 
requests  for  communication,  decides  the  minimum  route,  and  sets  up 
the  data  paths  between  nodes  by  turning  appropriate  switches  "on" 
and  non-appropriate  switches  "off". 

III.  DESCRIPTION  OF  LOOP  NETWORK 

We  suggest  a modified  loop  network  in  which  control  messages 
and  data  messages  are  transferred  through  two  different  communi- 
cation lines.  This  adds  flexibility  to  the  network  for  very  little 
increase  in  cost.  The  loop  network  system  is  configured  from  four 
d if f erent  component  s : 

1)  control  line  loop 

2)  data  line  loop 

3)  processor  nodes 

4)  a special  processor  node  dedicated  to  line  control. 

The  control  line  loop  employs  a polling  technique  to  start 
and  stop  the  transfer  of  messages  from  a source  node  to  a destina- 
tion node.  Transmission  is  accomplished  through  a "double  hand- 
shake" where  a request  to  send  is  followed  by  an  acknowledgement 
that  the  message  has  been  received.  In  particular,  there  are  two 
different  possible  types  of  messages.  SYN/ACK  and  Relay  Control 
which  can  be  sent  over  the  control  line. 

SYN/ACK:  When  a node  desires  to  communicate  with  another  node 
(SYN) , or  respond  to  end  of  communication  (ACK) , then  it  will  send 
a message  to  the  controller  containing  the  address  of  the  source 
node  and  the  address  of  the  destination  node  along  with  the  command 


(either  SYN  or  ACK)  to  be  performed  by  the  controller.  Messages 
of  this  type  have  the  format  shown  in  Figure  2(A). 

Relay  Control:  Messages  sent  from  the  controller  to  a source 
or  destination  to  inform  the  node  that  a message  is  being  sent  to 
it  (destination),  or  that  a message  has  been  received  by  the 
destination  node  (source) , or  directing  other  nodes  to  position 
their  data  switches  to  bypass  the  data  and  allow  it  to  continue 
along  the  data  loop  until  reaching  its  intended  destination.  The 
messages  of  this  type  are  shown  in  Figure  2(B). 

The  data  line  loop  transfers  all  the  data  messages  from  any 
source  node  to  any  destination  node  through  a minimum  route  which 
has  already  been  set  up  by  the  controller  as  explained  above.  The 
data  line  loop  illustrated  in  Figure  (3)  is  interfaced  to  each  node 
through  a three-way  switch  at  each  node  which  enables  the  node  to 
connect  segments  of  the  data  line  together  and  either  bypass  the 
node  or  connect  the  node  to  the  data  loop  so  that  the  node  can 
receive  or  send  data.  The  controller  sets  the  three-way  switches 
before  each  data  transmission  is  allowed.  For  example,  if  Node  1 
of  Figure  (4A)  is  to  send  data  messages  to  Node  3,  then  the  switches 
and  data  segments  are  connected  in  one  of  the  configurations  shown 
in  Figure  (4).  Observe  that  the  connection  of  segments  of  the  data 
loop  permit  partial  use  of  the  entire  data  loop  network  Figure  (4). 
Remaining  segments  of  the  data  loop  are  available  for  concurrent 
data  transmission  to  other  nodes  in  the  system.  Therefore,  simul- 
taneous transfer  over  non-interfering  segments  of  the  network  is 
quite  possible.  The  combined  effect  of  redundant  alternate  paths 
and  concurrent  transmission  over  non-interfering  segments  of  the 
loop  adds  to  the  network  reliability  and  throughput. 

The  partitionable  loop  structure  described  above  is  a general 
structure.  In  addition  to  the  loop  topology  studied  here,  there 
is  also  the  potential  for  other  configurations.  The  topology  of  a 
specific  network  may  require  high-speed  transmission  between  two 
or  more  nodes,  depending  upon  the  needs  of  these  two  processor 
nodes.  In  such  a special  case,  it  may  be  expedient  to  include 
additional  "express"  buses  to  supplement  the  basic  loop.  This  can 
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be  done,  for  example,  as  shown  in  Figure  (5),  by  merely  increasing 
the  capability  of  control  line  sv/it.ches  at  these  nodes.  In  the 
examples  of  Figure  (5),  supplemental  data  buses  may  be  used  to 
establish  high  bandwidth  communication  between  Node  1 and  Node  4. 
Alternatively,  the  response  time  of  communication  between  Nodes  4 
and  2 may  justify  an  additional  data  line  as  shown  in  Figure  (SB). 

Processor  nodes  are  configured  from  foui'  elements: 

A.  A node  control  mechanism  to  perform  data  loop  and  control 
loop  functions. 

B.  Control  switches  to  switch  the  data  lines. 

C.  Transmitter  and  receiver. 

D.  Terminal  processor  which  may  be  a simple  I/O  device,  a 
microcomputer,  or  an  interface  to  another  network. 

Figure  (6)  illustrates  these  four  elements.  Each  node  control 
mechanism  provides  timing  control,  message  detection,  decoding  and 
encoding  of  messages,  controlling  the  data  switches,  transmitter 
and  receiver  control,  and  communication  with  its  node  terminal. 

The  control  switch  is  a modular  unit  easily  extendable  through 
hardware  changes-,  for  instance,  a control  sv;itch  can  control  two 
data  segments  along  with  the  receiver  and  transmitter.  If  the 
number  of  data  segments  interfaced  to  the  node  increases,  the  com- 
plexity of  the  switches  will  increase  in  a modular  manner. 

A simple  transmitter-receiver  can  be  time  multiplexed  or 
separated  from  each  other  by  using  separate  channels  which  adds  to 
complexity  to  the  control  switches.  Figure  (7)  shows  both  a simple 
and  more  complex  transmitter  receiver  section. 

The  loop  network  interface  is  designed  as  an  "intelligent 
interface"  so  that  no  assumption  about  the  processor  terminal  is 
needed.  Any  device  may  be  plugged  into  the  loop  network  regardless 
of  its  sophistication.  All  the  control  needed  for  any  terminal  to 
talk  to  the  receiver-transmitter  section  is  provided  by  the  node 
control,  thus  allowing  terminals  to  be  of  any  type.  The  intelli- 
gence of  the  node  controller  is  easily  provided  by  a low-cost 
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microprocessor  and  PROM. 


The  loop  controller  functions  are  as  follows : 

A.  Sends  and  receives  control  messages  to  and  from  control 
line . 

B.  Schedules  node  communications. 

C.  Finds  the  minimum  path  between  the  nodes  which  are  to 
communicate . 

D.  Provides  a timing  mechanism. 

The  control  messages  have  the  formats  of  Figure  2(A)  or  Figure 
2(B).  The  controller  decodes  or  encodes  them  by  managing  the  right 
timing.  Scheduling  of  nodal  communication  may  be  by  any  scheduling 
algorithm  as  LIFO,  FIFO,  round  robin,  or  shortes t-messages-first . 

For  the  routing  algorithm,  any  method  can  be  considered,  but  since 
all  the  needed  information  is  within  the  controller,  routing  can  be 
tailored  to  special  applications  of  the  network.  The  timing  mech- 
anism can  be  part  of  the  controller’s  function  to  synchronize  all 
the  nodes  or  it  can  be  varied  in  each  individual  node.  Therefore, 
nodes  can  work  synchronously  or  asynchronously.  The  function  of 
the  controller  is  flovrcharted  in  Figure  (8).  The  functions  of  the 
network  controller  are  very  straightforward  and  can  be  performed 
by  any  node  in  the  network.  We  will  assume  a special  control  node 
microprocessor  is  used  to  perform  the  control  functions  for  the 
entire  network.  In  the  comparisons  to  follow,  we  will  include  this 
special-purpose  control  node  as  an  overhead  cost , but  it  should  be 
pointed  out  that  the  control  functions  required  by  the  proposed 
loop  can  be  carried  out  by  any  node.  In  terms  of  reliability,  this 
means  that  failure  of  the  control  node  does  not  imply  failure  of  the 
entire  network,  because  control  can  be  passed  to  another  (working) 
node  on  the  loop. 

IV.  SIMULATION  RESULTS 

We  modeled  our  simulation  study  after  the  work  of  Reames  and 
Liu  (REAM  75).  They  simulated  the  DLCN  (Distributed  Loop  Computer 
Network),  Newhall  Loop,  and  Pierce  Network.  The  results  obtained 
in  our  study  will  be  compared  with  their  results.  Our  results 
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will  extend  their  results  to  provide  an  evaluation  of  all  four 
network  topologies.  In  the  DLCN  simulation  model,  the  length  of 
the  shift  register  interface  to  the  loop  was  512  characters.  For 
the  Pierce  model,  Reames  and  Liu  selected  a packet  size  of  36 
characters.  This  is  an  optimal  packet  size  obtained  by  minimizing 
the  product  of  average  number  of  i>ackets  times  the  packet  size. 

In  the  Newhall  network,  they  simulated  passing  the  control  token 
only  when  the  queue  of  messages  in  that  node  is  empty  instead  of 
passing  one  message  at  a time  at  each  node.  This  produces  a shorter 
total  message  transmit  time  for  the  Newhall  network. 

For  all  of  the  systems  simulated  by  Liu  and  Reames,  message 
length  has  a truncated  negative  exponential  distribution  with  a 
mean  of  50  characters,  minimum  of  10,  and  maximum  of  512  characters 
of  which  the  first  nine  characters  are  control  characters.  Message 
arrival  time  obeys  the  Poisson  distribution,  and  the  number  of  nodes 
is  6 . 

For  the  new  experimental  loop,  the  message  length  and  message 
arrival  statistics,  and  the  number  of  nodes  are  the  same  as  above. 
There  is  no  need  for  control  messages  along  with  data  in  this  new 
system.  For  reliability  purposes,  we  used  the  same  number  of  char- 
acters by  including  control  characters  with  the  data.  The  messages 
in  this  system  can  be  of  any  length  without  hardware  or  software 
constraints . 

The  scheduling  algorithm  is  simple  FIFO  and  the  routing  al- 
gorithm is  to  simply  find  the  minimum  path  between  two  nodes  in 
either  direction.  If  two  paths  have  the  same  length,  the  clock- 
wise direction  is  arbitrarily  chosen.  The  new  loop  network  improves 
throughput  when  employing  these  simple  algorithms  for  scheduling 
and  routing. 

Table  1 shows  the  average  interarrival  rate,  data  line  usage, 
waiting  Jtime  for  each  message  to  be  transmitted,  transmission  time 
-tot'al  transmission  time,  and  control  line  usage  for  the  new  experi- 
mental network,  as  well  as  for  the  other  three  networks.  Figure  (9) 
shows  the  variation  of  mean  total  message  transmission  time  versus 


mean  arrival  rate  for  all  four  networks.  Figure  (10)  shows  the 
changes  in  line  utilization  versus  changes  in  interarrival  rate 
for  all  systems,  which  indicates  the  load  of  the  system,  and  finally 
Figure  (11)  is  a graph  of  mean  control  line  utilization  versus  the 
mean  interarrival  rate  for  new  experimental  systems,  only. 

From  Table  1,  we  see  the  Pierce  and  Newhall  loops  and  new 
experimental  loop  have  almost  a constant  transmission  time  for  any 
load  on  the  system  (46  time  units  per  packet  for  Pierce  loop,  and 
63  time  units  per  message  for  Newhall  loop,  and  52  time  units  for 
the  new  experimental  loop).  This  is  due  to  a constant  delay  in 
the  transmission  lines  for  Pierce  and  Newhall  systems.  For  the 
new  experimental  loop,  there  is  no  delay  in  transmission  line. 
Transmission  time  is  equal  to  the  transfer  time  of  the  characters 
in  a message.  For  DLCN,  message  transmission  time  is  variable  and 
as  soon  as  the  arrival  rate  increases  ( that  is,  the  system  load 
increases)  then  the  shift  register  delay  line  time  will  increase 
leading  to  an  increase  in  transmission  time  proportional  to  system 
load.  On  the  other  hand,  the  queuing  time  at  each  node  will  not 
increase  as  fast  as  transmission  time  since  whenever  a message  is 
ready  to  go  in  the  loop,  the  node  will  insert  the  variable  delay 
shift  register  in  the  loop  and  then  the  message  does  not  have  to 
wait  longer.  This  explains  why  DLCN  is  faster  than  the  Pierce  and 
Newhall  loops.  The  superior  performance  of  the  new  experimental 
loop  is  due  to  multiple  concurrent  transmission,  variable  message 
length  without  any  additional  hardware  or  software  overhead,  and 
the  ability  to  select  the  shortest  path  from  the  bidirectional 
segments  of  the  loop.  As  we  see  from  Figure  (9),  total  transmission 
time  for  Newhall,  DLCN,  and  the  new  experimental  loop  is  the  same 
for  very  low  system  load.  But  as  soon  as- the  load  on  the  system 
goes  higher,  the  total  transmission  time  for  the  new  experimental 
loop  shows  improvement  over  the  others.  In  the  Pierce  loop,  a 
message  always  has  a mean  wait  equal  to  one-half  of  the  packet 
size  and  must  then  be  transmitted  in  several  packets.  For  this 
reason,  the  Pierce  loop  can  not  compete  with  the  others  for  low 
systems  loads.  As  soon  as  the  system  load  goes  higher,  the  Pierce 
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loop  exhibits  concurrency  (simultaneous  packets  on  the  loop)  and 
its  performance  improves  over  tlie  Newhall  loop  which  shows  its 
inherent  serial  nature  leading  to  poorer  performance. 

In  our  new  experimental  loop  there  is  a minimum  queuing  time 
for  SYN/ACK  and  relay  control  messages.  For  low  loading  of  the 
network  this  overhead  shows  up  as  a significant  part  of  the  over- 
all delay,  but  since  these  two  control  messages  cause  a constant 
average  delay  they  contribute  a smaller  proportion  of  the  delay  as 
the  network  load  increases.  Typically  messages  are  queued  before 
being  transmitted  and  the  delay  due  to  control  messages  is  over- 
lapped with  the  fixed  control  message’s  queuing  time. 

The  greatest  advantage  of  the  new  network  is  that  segments  of 
the  loop  can  be  activated  simultaneously.  The  added  concurrency 
of  the  new  loop  explains  its  increased  throughput  when  compared 
with  the  other  networks.  From  Figure  (10)  we  see  the  mean  line 
utilization  is  very  low  for  all  the  networks.  As  system  load  in- 
creases the  line  utilization  for  Newhall  network  levels  off  at  about 
50  percent.  For  Pierce  and  DLCN  systems,  line  utilization  increases 
as  system  load  increases.  However,  when  the  loop  is  utilized  up 
to  its  maximum,  the  waiting  time  will  increase  drastically.  The 
proposed  network  requires  nearly  half  of  the  line  utilization  of 
the  other  loops  simulated.  Figure  (11)  shows  a linear  relationship 
between  the  mean  control  line  usage  and  system  load.  This  is  due 
to  constant  delay  for  SYN/ACK  messages,  however  the  relay  control 
message  is  of  variable  length,  (changes  are  within  7 percent). 

CONCLUSION: 

The  main  goal  of  this  work  was  to  improve  the  throughput  of  a 
microcomputer  network  using  a flexible,  simple,  and  reliable  loop 
topology. 

The  results  of  our  simulation  have  shown  that  completely  de- 
centralizing microcomputers  leads  to  a decrease  in  throughput  com- 
pared to  the  expected  throughput  of  n processors.  The  loss  in 
throughput  resulting  from  networking  multiple  processors  can  be 
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partially  coiapensated  for  by  careful  design  of  the  network  and  its 
interfaces.  Reliability  can  be  achieved  by  permitting  any  node  to 
take  over  the  controller's  job. 

The  hardware  implementations  given  for  the  interface  and  line 
controller  show  compatibility  of  this  system  with  microprocessor 
technology.  Because  microprocessors  are  low  cost,  this  type  of 
network  can  be  constructed  inexpensively. 

Future  research  in  this  area  will  be  done  using  different 
scheduling  algorithms  for  the  controller,  using  a different  number 
of  nodes,  and  with  different  types  of  loop  structures.  Also  an 
investigation  of  a mathematical  model  for  such  a loop  structure, 
as  has  been  done  in  the  past  for  other  loop  structures  is  needed. 
(SPRA  72),  (HAYE  74),  (KONH  72),  and  (KAYE  72). 
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Abstract 

We  describe  a computer  architecture  for  parallel  processing  centered 
around  the  parallel  evaluation  of  recursive  programs  by  simultaneously 
performing  as  many  subtasks  as  possible.  The  computer  system  is  based 
on  the  parallel  execution  of  LISP  programs  as  a collection  of  asychronous 
subtasks.  The  design  can  be  viewed  as  a multi -instruct! on,  multi -data 
(MIMD)  data-flow  architecture  which  is  extensible  in  the  sense  that 
processors  can  be  added  to  or  deleted  from  the  system  without  disturbing 
the  overall  design.  The  processors  function  independently  using  a shared 
central  memory.  We  intend  to  develop  a simulator  for  the  system  in  order 
to  collect  statistics  on  the  speed-up  attainable  with  this  architecture 
as  a function  of  the  number  of  processors  available  and  the  kind  of  tasks 
being  performed. 
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1.  Introduction 


1.1  General . LSI  technology  has  provided  us  with  large  numbers  of  inexpensive 
microprocessors.  “In  order  to  develop  the  full  potential  of  this  technology, 
we  must  move  beyond  the  concepts  of  sequential  control  for  programs  and 
centralized  control  for  computer  systems.  Because  we  are  reaching  the  tech- 
nological limits  of  processor  speeds,  parallelism  and  distributed  processing 


appear  to  be  the  techniques  by  which  the  speed  of  computation  will  be  increased. 
Extensive  use  of  microprocessors  in  the  design  of  computer  systems  makes  it 
possible  to  rank  CPU  components  among  the  relatively  cheap  parts  of  the 
system.  This  calls  for  a new  viewpoint  in  design  and  makes  the  consideration 
of  radically  new  and  different  architectures  worthwhile.  The  computer  system 
we  will  describe  uses  a large  number  of  microprocessors  working  in  an 
asynchronous  mode  in  order  to  attain  a significant  degree  of  parellelism 
in  the  computations  being  performed.  Furthermore,  the  system  is  able  to 
take  advantage  of  parallel  computation  without  requiring  programmers  to 
use  new  and  specialized  techniques  in  designing  algorithms  for  the  system. 

This  latter  point  is  significant  because  the  cost  of  software  has  ' 

continued  to  increase  while  reliability  of  software  has  become  a very  serious 
problem.  If  this  is  true  for  sequential  programs,  which  we  have  been  coding  ? 

since  the  advent  of  computing,  the  problems  of  reliability  and  production  ii 

costs  will  be  even  more  prevalent  in  programming  parallel  systems  where  ; 

the  programmer  must  tailor  his  algorithms  to  a specific  parallel  machine  I 

organization  in  order  to  take  full  advantage  of  the  system.  Currently, 
there  are  very  few  languages  available  which  support  parallel  programming.  I 

Parallel  notation  is  still  at  a primitive  level.  A notation  for  parallelism  \ 

\ 

in  algorithms  which  is  independent  of  particular  machines  is  desirable.  A j 

J 

notation  which  doesn  t require  the  programmer  to  explicitly  determine  the  1 

parallelism  is  even  more  desirable.  Then  the  programmer  can  turn  his  ] 

j 

attention  to  the  proper  design  of  the  program  and  leave  it  to  the  computer  | 
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system  to  discover  parallelism  in  the  algorithm..  Since  purely  recursive 
programs  make  this  possible,  we  want  to  investigate  the  degree  of  speed-up 
attainable  through  parallel  evaluation  of  recursive  programs. 

We  propose  to  study  the  attainable  speed-up  through  a detailed  simula- 
tion of  a computer  system  which  directly  evaluates  recursive  programs.  The 
structure  of  the  system  is  similar  to  the  data  flow  architectures  of  Dennis, 
[Dec,  1974;  May,  1975]  Rumbaugh  [1977],  and  Gostelow  and  Arvind  [1976]. 

Among  the  results  of  this  simulation  will  be  data  on  the  speed-up  attainable 
using  such  architectures  on  various  classes  of  problems  (for  example,  solving 
systems  of  equations,  differential  equations,  sorting,  searching,  etc.)  as 
a function  of  both  the  number  of  inputs  (problem  complexity)  and  the  number 
of  microprocessors  in  the  computer  system  (system  complexity).  As  a step 
towards  a comprehensive  evaluation  of  data  flow  architectures,  the  results 
of  such  a simulation  would  be  valuable  to  scientists  doing  research  in 
data  flow  architectures  specifically  as  well  as  to  those  interested  in 
other  aspects  and  viewpoints  in  the  design  of  parallel  systems. 

We  intend  to  make  the  simulation  extensible  to  other  related  architectures. 
In  particular,  the  system  we  will  study  deals  with  the  problem  of  a central 
memory  accessible  by  many  microprocessors  in  an  unrealistic  way.  (It 
assumes  memory  conflicts  only  occur  when  more  than  one  microprocessor  attempts 
to  access  the  same  word  in  memory  during  the  same  memory  cycle  - that  is, 
single-word  memory-banks.)  In  the  future,  we  want  to  be  able  to  extend  our 
simulation  to  handle  other  hypotheses  about  the  organization  of  central 
memory. 

If  results  warrant  it,  we  may  be  able  to  implement  a data  flow  computer 
system  using  network  microprocessor  facilities  in  our  laboratory  at  Colorado 
State  University.  (However,  at  this  writing,  funding  for  these  facilities 
is  still  pending.) 

As  a first  step  towards  an  eventual  simulation,  we  will  describe  a 

machine  for  parallel  evaluation  of  recursive  programs. 
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1.2  Underlying  ideas.  A safe  computation  rule  for  a program  is  one  which 
guarantees  completion  of  the  computation  if  completion  is  possible  at  all. 
There  are  several  well-known  safe  computation  rules  for  recursive  programs 
[Manna,  1974].  Many  of  them  (e.g.  full  substitution)  automatically  provide 
a scheduling  algorithm  for  evaluating  subcomputations  in  parallel.  Our 
simulation  will  quantitatively  estimate  the  decrease  in  time  required  for 
execution  of  recursive  programs  when  a number  of  processors  are  available 
to  perform  subcomputations  simultaneously.  Specifically,  we  will  chart 
the  percentage  of  speed-up  obtainable  by  parallel  evaluation  of  recursive 
programs  as  a function  of  the  number  of  processors  available.  There  will 
be  a different  chart  for  every  program  but  by  choosing  the  programs  over 
a wide  range  of  typical  computations,  we  will  present  a general  picture  of 
the  average  speed-up  attainable  for  different  types  of  computations. 

We  want  to  emphasize  that  we  are  interested  primarily  in  the  speed-up 
attainable  without  regard  to  memory  problems.  There  is  no  doubt  that  these 
parallel  evaluation  schemes  for  recursive  programs  will  require  more  memory 
for  execution  than  current  evaluation  schemes.  (It  is  difficult  to  say 
how  much  more,  but  that  figure  will  be  another  result  of  the  study.) 
Further,  to  achieve  the  greatest  increase  in  speed  we  will  use  a memory 
in  which  the  words  are  banked  individually.  That  is,  we  will  assume  that, 
in  at  least  part  of  the  available  memory,  no  addressing  conflicts  will 
occur  unless  two  processors  simultaneously  attempt  to  address  the  same 
word.  Such  a multi-port  memory  would  be  exhorbi tantly  expensive  with 
current  technology  (indeed,  the  trend  is  toward  larger  banks,  not  smaller 
ones),  but  future  technologies  may  solve  this  problem.  In  any  case,  we 
are  interested  in  the  speed-up  attainable  under  the  best  memory  conditions. 
If  programs  run  significantly  faster  with  these  evaluation  schemes,  then  we 
can  attack  the  memory  problems. 


1.3  Relationship  to  previous  work.  Work  in  parallel  processing  has  been 
given  a tremendous  impetus  by  recent  improvements  in  LSI  technology.  CPU's 
are  no  longer  the  most  expensive  components  of  computing  systems.  On  the 
contrary,  they  are  among  the  cheapest.  It  is  clear  that  a new  strategy  for 
designing  computing  systems  is  called  for,  and  there  are  many  avenues  which 
should  be  investigated.  All  researchers  in  the  field  recognize  this  need. 

The  avenue  we  are  exploring  is  closely  related  to  current  research  on 
data  flow  architectures  [Dennis,  1974;  Rumbaugh,  1977;  Gostelow  and  Arvind 
1976].  In  fact  the  results  of  our  study  could  be  interpreted  as  an  evalua- 
tion of  one  type  of  data  flow  architecture.  However,  our  proposed  research 
differs  from  current  work  on  data  flow  architectures  in  two  important  ways. 
First,  our  machine  is  designed  directly  around  an  existing  and  well  studied 
programming  language,  LISP  1.5.  This  avoids  the  pitfalls  cf  designing 
new  programming  languages  for  new  machines.  In  addition,  the  use  of  LISP 
avoids  requiring  the  programmer  to  explicitly  discover  parallelism  in  the 
computation  his  program  implements.  This  could  be  an  important  advantage 
[Glushkov,  et  al , 1974].  Other  efforts  in  this  general  direction  are  Weng's 
design  of  a stream-oriented  data  flow  language  [Weng,  1975],  and  the 
important  work  of  Kuck,  Kogge  and  others  in  the  translation  of  FORTRAN 
and  ALGOL-like  programs  into  programs  for  computer  systems  with  a capacity 
for  a high  degree  of  parallelism  [Kuck,  et  al,  1947;  Kogge,  1974;  Ramamoorthy, 
1969;  Stone,  1967].  We  hope  that  our  efforts  will  complement  theirs  and 
provide  another  set  of  alternatives  for  future  researchers  in  the  field. 

A second  way  in  which  our  approach  differs  from  current  work  on  data 
flow  architectures  is  in  the  nature  of  the  parallelism.  Whereas  most  data 
flow  schemes  require  a complete  set  of  inputs  before  initiating  a process 
stored  at  a node,  subtasks  in  our  programs  are  initiated  as  soon  as  they 
are  encountered.  This  has  important  advantages  in  computing  partial 
recursive  functions.  Not  only  is  more  concurrency  allowed  in  some  cases 
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(namely,  when  a function  doesn't  need  all  its  arguments  to  complete  its 
computation)  but  it  also  provides  a safe  computation  rule.  In  essence, 
our  machine  simulates  the  full  substitution  method  of  evaluation  of  recursive 
programs.  This  rule  guarantees  that  a program  will  terminate  on  the  widest 
possible  range  of  inputs  [Manna,  1974]. 

An  important  motivation  for  studying  the  execution  characteristics  of 
a recursive  language  is  the  feeling  in  the  computing  conriunity  that  languages 
like  pure  LISP  (i.e.,  variableless,  recursive  languages)  encourage  the 
production  of  wel 1 -designed,  reliable,  verifiable  programs  [Landin,  1966; 
Noonan  and  Pantor,  1974;  Burstall,  1969;  McCarthy,  1962;  and  Glushkov, 
et  al , 1974].  If  these  perceptions  are  accurate,  it  may  be  possible  to 
speed  up  the  execution  of  programs  (via  parallelism)  and  increase  program 
reliability  at  the  same  time  by  using  recursive  programming  techniques. 

An  additional  advantage  of  basing  our  system  on  LISP  is  the  existence  of 
a large  number  of  programs  already  written  which  may  be  used  in  gathering 
statistics  on  the  speed-up  attainable  with  our  computer  system.  Our  pri- 
mary goal  is  to  help  establish  the  advantages  of  certain  architectures  for 
easily  programmable  computers  capable  of  extensive  parallel  processing. 

2.  Multiprocessor  Architecture 

There  are  four  principal  components  of  the  computer  system,  (1)  a 
supervisor  which  maintains  the  list  of  available  processors  and  assigns 
processors  to  tasks,  (2)  a collection  of  independent,  identical  processors, 

(3)  a queue  of  tasks  waiting  to  be  performed  (first-in,  first-out),  and 

(4)  a central  memory. 

2.1.  The  supervisor.  The  supervisor  is  responsible  for  monitoring 
the  task  queue.  If  a task  is  waiting  to  be  processed  and  a processor  is 
available,  the  supervisor  removes  the  task  from  the  queue,  removes  the 
processor  from  the  roster  of  available  processors,  and  assigns  the  task 
to  the  processor.  The  processors  themselves  run  asynchronously. 


i 


I 


I 


2.2  The  processors.  The  computer  may  have  any  number  of  processor 
components,  all  of  which  are  identical.  Each  processor  has  the  following 
characteristics. 

1.  It  can  perform  any  system  function. 

2.  It  can  correctly  scan  any  compiled  program. 

3.  It  can  access  central  memory.  The  four  modes  of  memory  access  are: 

(a)  Store  value  at  given  address, 

(b)  Fetch  value  from  given  address, 

(c)  Pull  an  address  off  of  the  free-space  list  (garbage  allocation). 

(d)  Return  an  address  to  the  free-space  list  (garbage  collection). 

4.  It  has  a local  queue  to  keep  its  place  in  scanning  compiled  programs. 

5.  It  has  a small,  local,  random  access  memory. 

2.3.  The  task  queue.  During  any  single  time  slice  the  supervisor  may 
remove  at  most  one  task  from  the  queue.  In  addition,  during  the  same  time 
slice,  one  or  more  processors  may  attempt  to  place  a task  on  the  queue.  Only 
one  addition  to  the  queue  may  be  made  during  a single  time  slice.  To 
resolve  conflicts,  processors  are  numbered  sequentially,  the  processors 

with  the  lowest  number  having  the  highest  priority  in  placing  tasks  on  the 
queue.  Thus,  conflicts  are  resolved  in  favor  of  the  processor  with  the 
lowest  number. 

2.4.  The  central  memory.  The  words  in  central  memory  are  organized  as 
single-word  banks.  Memory  conflicts  only  arise  when  two  processors  attempt 
to  gain  access  to  the  same  word  in  the  same  time  slice.  As  with  the  task 
queue,  conflicts  are  resolved  in  favor  of  the  processor  with  the  lowest 
number.  Because  of  the  single-word  banks,  the  memory  and  its  addressing 
hardware  is  the  most  complicated  component  of  the  entire  system.  This  is 
typical  of  multiprocessor  systems  of  this  type  [Glushkov,  1974].  Reducing 
the  complexity  by  increasing  the  size  of  the  banks  is  the  obvious  solution, 
but  that  would  obviously  degrade  performance.  Our  goal  is  to  see  what  kind 
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of  performance  is  possible  with  a complex  central  memory  before  attacking 
the  problem  of  reducing  the  memory's  complexity. 

Another  potential  memory  conflict  arises  when  two  or  more  processors 
attempt  to  pull  (or  put)  an  address  from  (or  onto)  the  free-space  list.  To 
alleviate  these  conflicts  each  processor  is  allocated  its  own  free-space 
list  by  the  supervisor.  If  a processor's  free-space  list  becomes  empty, 
the  supervisor  is  notified  and  allocates  additional  free-space  to  the 
processor.  Again,  we  assume  the  best  of  all  plausible  worlds  with  respect 
to  the  memory  structure  of  our  computer  system. 

3.  Programs  for  the  Computer  System 

3.1  Source  programs.  Source  programs  will  be  written  in  a subset  of  LISP 
1.5  [McCarthy  et  al,  1965].  A program  will  be  a sequence  of  function 
definitions  followed  by  a sequence  of  function  references.  To  write  user- 
defined  functions,  we  will  use  the  pseudo-function  DEFINE.  Control  within 
a function  will  be  provided  by  the  special  form  COND  and  by  function 
references  nested  inside  a function  definition  (including  recursive  refer- 
ences). There  will  be  no  other  control  mechanisms.  It  would  be  easy  to 
add  compatible  control  structures  like  AND  and  OR.  However,  assignments 
and  sequential  controls  will  not  be  added  because  the  use  of  variables  violates 
one  of  our  basic  assumptions  (i.e.,  absence  of  side-effects). 

Because  the  source  programs  will  be  compiled,  rather  than  interpreted, 

we  do  not  allow  the  function  EVAL.  This  avoids  calling  in  the  compiler  at 

execution  time.  Similarly  QUOTE  is  only  an  instruction  to  the  compiler 
« 

(telling  it  to  set  up  a data  item  in  memory)  and  is  not  a system  function. 
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To  summarize,  source  programs  will  be  written  in  a subset  of  LISP  j 

1.5.  The  subset  includes  the  pseudo-function  DEFINE,  and  the  special  forn  j 

COND.  The  special  forms  LAMBDA  and  QUOTE  may  only  be  used  in  trivial  ways 
LAMBDA  in  DEFINE  clauses  and  QUOTE  for  describing  data  to  the  compiler. 

For  convenience  in  writing  programs,  the  special  form  LIST  will  also  be 

implemented  and  translated  by  the  compiler  into  a series  of  references  | 

1 

to  CONS.  I 

3.2  System  functions.  System  functions  may  be  thought  of  as  machine  i 

I 

language  instructions.  Each  processor  can  directly  execute  a system  I 

function,  given  legitimate  arguments.  j 

We  will  write  the  simulator  so  that  the  set  of  system  function  H 

can  be  easily  changed.  In  this  way  we  will  be  able  to  study  the  effective-  :i 

ness  of  different  collections  of  elementary  operations.  Il 

Our  current  plan  is  to  start  with  the  set  of  elementary  function 
shown  in  the  Table  of  System  Functions.  They  are  taken  directly  from  LISP 
1.5.  Other  system  functions  can  be  added  without  difficulty  if  it  seems  ' 

warranted,  but  the  ones  listed  here  are  sufficient  in  a theoretical  sense. 


Table  of  System  Functions 


Function 

Arguments 

Value 

Timing 

CAR 

non-empty  list 

first  element  of  arg 
arg  with  list  element 

1 

CDR 

non-empty  list 

deleted 

1 

CONS 

list  or  atom, 

2nd  arg  with  1st  arg 

1 

list 

inserted  at  beginning 

NULL 

list  or  ,atom 

true  iff  arg  is  the  empty 
list,  else  false 

1 

NOT* 

true  or  false 

false  or  true 

1 

ATOM 

list  or  atom 

true  iff  arg  is  an  atom, 
else  false 

1 

EQ 

non-numeric  atom  or 

true  iff  both  arg  same 

1 

1 ist 

non-numeric  atom  or 
list 

the  same  atom,  else  false 

+ 

numeric  atom, 
numeric  atom 

sum  of  args 

20** 

- 

numeric  atom, 
numeric  atom 

argl  minus  arg  2 

20 

★ 

numeric  atom, 
numeric  atom 

argl  times  arg  2 

100 

T 

numeric  atom. 

argl  divided  by 

300 

numeric  atom 

arg  2 

< 

true  iff  indicated 

20 

< = 

numeric  atom. 

numeric  relationship 

20 

s 

numeric  atom 

holds 

20 

< > 

20 

> 

> = 

20 

ENTIER 

numeric  atom 

arg  converted  to  integer 
by  truncation 

20 

LOG 

numeric  atom 

natural  logarithm  of  arg 

1000 

EXP 

numeric  atom 

exponential  of  arg 

1000 

* The  atoms  true  and  false  are  special  system  constants  denoted  T and  F 
in  source  programs.  The  empty  list  will  be  treated  as  non-atomic. 
Therefore,  NOT  and  NULL  are  different  functions. 


**  Since  timing  in  the  simulation  will  be  table  driven,  these  timing 
figures  can  be  easily  changed.  In  this  table  the  timing  for  numerical 
operations  reflects  our  assumption  that  component  processors  will  be 
microprocessors  with  a small  word  size  and  no  built-in  floating  point 


operations.  Numerical  operations,  if  implemented  in  software  or  firm- 
ware would  be  relatively  slow  as  shown  here.  With  more  sophisticated 
component  processors,  these  figures  would  be  substantially  reduced. 


3.3  Compiled  programs.  At  execution  time  all  user  programs  will  be 
represented  as  tree  structures.  Function  references  will  be  by  numeric 
address  of  the  code,  data  references  by  numeric  address  of  the  storage 
location,  and  argument  references  by  indirection  through  an  argument  table 
stored  in  the  local  memory  of  the  processor  performing  the  function.  The 
compiler  will  translate  source  programs  into  the  appropriate  tree  structure, 

(see  Fig.  1)  replacing  all  symbolic  references  to  functions,  parameters,  and 
data  by  numeric  addresses.  Each  address  in  the  compiled  code  will  be  flagged 
as  to  type:  function  reference,  data  reference,  or  parameter  reference. 

By  compiling  programs  in  this  way,  we  delegate  the  need  for  associative 

I 

memory  to  the  compiler,  and  it  can  be  handled  there  in  the  traditional  way 
(i.e.,  by  keeping  a symbol  table).  Avoiding  the  need  for  fast  associative 
memory  at  execution  time  is  an  important  way  to  reduce  the  cost  of  this 
system.  Other  researchers  [Glushkov,  1974 j have  found  associative  memory 
necessary  in  their  designs  of  computer  systems  of  this  type. 

Because  the  compiled  programs  are  tree  structures  and  because  operations 
have  no  side  effects,  all  code  is  reentrant.  In  fact,  the  code  for  a 
function  may  be  entered  at  any  node,  thus  allowing  partially  executed 
functions  to  be  resumed  at  the  point  of  departure  without  rescanning  any 
previously  executed  code. 

4.  Scanning  Compiled  Programs 

4.1.  Classification  of  nodes  in  function  trees.  A compiled  program  is  a 
collection  of  function  trees.  To  scan  a function  tree  is  to  trace  through 
its  nodes  in  the  appropriate  order.  A node  in  a function  tree  may  be 
(1)  a pointer  to  a data  time  (a  value  node),  (2)  an  index  into  an  argument 
list  (a  dummy- parameter  node),  (3)  the  operation  code  for  a system  function, 

(4)  a pointer  to  a function  tree.  If  a node  is  of  type  (3)  or  (4),  then 
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Source  Program 


Meaning  of  Symbols 


Meaning  of  Pointers 


(LENGTH  (lambda  (A)  (COND 
( (NULL  A)  0) 

( T (+1  (LENGTH  (COR  A)))  ] 

))) 


opcode  or 
pointer  to 
function-tree 


f 


find  value  of  p 
(true  or  false) 


pointer  to 
data  Item  d 


find  value  of  v and 
return  it  as  value 
of  function 


use  a,  b,  ....  x as 
arguments  and  refer 
to  function  f 


it  is  a function  reference.  If  the  function  referenced  in  a node  requires 
arguments,  then  an  argument-branch  emanates  from  the  node.  An  argument-branch 
points  to  a list  of  actual  arguments  for  the  function.  Any  element  in  an 
argument  list  is  a node  in  the  function  and  may  be  of  any  type,  (1),  (2),  (3), 
or  (4). 

In  addition  to  being  classified  as  to  type,  nodes  are  also  classified 
as  to  form  (1),  (2),  or  (3).  A node  of  form  (1)  is  a terminal  node  and 
has  no  branches  emanating  from  it.  Such  a node  corresponds  to  the  final 
value  of  a function. 

A node  of  form  (2)  is  a decision  node.  Two  different  branches  emanate 
from  a decision  node,  the  t-branch  and  the  f-branch.  The  value  indicated 
at  a decision  node,  regardless  of  the  type  ((1),  (2),  (3),  or  (4))  of  the 
decision  node,  must  be  either  the  atom  true  or  the  atom  false.  A processor 
scanning  a decision  node  will  proceed  down  one  branch  or  the  other  depending 
on  the  value  at  the  node,  and  ignore  the  other  branch  altogether. 

A node  of  form  (3)  is  an  argument  node.  An  argument  node  is  a member 
of  a list  of  actual  arguments  for  a function  (see  types  (3)  and  (4)).  The 
value  at  an  argument  node  will  be  used  as  an  argument  for  a function. 

Emanating  from  every  argument  node  is  a link  to  the  next  argument  in  the 
list  or  a termination-link  if  it  is  the  last  argument  in  the  list. 

Any  combination  of  types  and  forms  is  possible  at  a node.  For  example, 
a node  of  form  (3)  and  type  (4)  is  a function  reference  where  the  value  of 
the  function  will  be  used  as  the  argument  for  another  function.  A second 
case  is  a node  of  form  (1)  and  type  (2).  In  this  case,  an  argument  which 
has  been  passed  into  a function  (pointed  to  by  a dummy  parameter)  is  being 
used  as  the  value  of  the  function.  It  should  be  noted  that  this  division 


of  types  and  forms  prohibits  the  use  of  CONDitional  expressions  as  arguments 
in  function  references.  This  is  not  a serious  restriction,  however.  Programs 


which  seem  to  need  such  a construction  are  better  written  with  the  COND-clause 


replaced  by  a call  to  a helping  function. 


4.2  The  Task  Queue.  Entries  on  the  task  queue  are  either  class  I entries 


(function  evaluations)  or  class  II  entries  (decision  nodes  awaiting  predicate 


val ues) . 


A Class  I entry  is  a record  with  the  following  components: 


0 function-tree  address  or  system-function  op-code 


0 argument  table  address 


0 destination  address 


A Class  II  entry  is  a record  with  the  following  components: 


0 predicate  value  address 


0 T-branch  address 


0 F-branch  address 


0 argument  table  address 


0 destination  address 


When  a processor  pulls  a class  I task  from  the  queue,  it  sets  its 


program  counter  (PC)  to  the  indicated  function-tree  address  or  op-code 


and  sets  its  argument  table  address  register  and  destination  address 


register  to  the  values  in  the  task  queue  entry.  Then  it  goes  into 


function-scan  mode. 


When  a processor  pulls  a task  of  class  II  off  the  queue,  it  examines 


the  contents  in  the  indicated  predicate  value  address.  If  the  predicate 


value  is  undefined,  the  processor  puts  the  class  II  task  back  on  the  queue 


and  releases.  If  the  predicate  value  is  defined,  the  processor  sets  its  PC 


to  the  node  indicated  in  either  the  T-branch  or  F-branch  of  the  task  (depending 


on  the  predicate  value,  of  course).  It  also  sets  its  destination  address 


register  and  argument  table  address  register  to  the  values  indicated  in  the 
task.  Then  the  processor  goes  into  function-scan  mode. 
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4.3  Function  Scan  Mode.  The  following  rules  are  used  by  processors  in 
scanning  functions. 

4.3.1  The  value  of  a node  may  be  obtained  directly  if  one  of  the  following 
conditions  holds; 

(a)  The  node  is  of  type  1 (data  value) 

(b)  The  node  is  of  type  2 (dummy  parameter)  and  the  memory  word  pointed 
to  indirectly  through  the  argument  table  has  been  defined  (again 
the  value  of  the  node  is  of  the  value  of  the  v/ord). 

4.3.2  If  a node  of  form  1 (terminal)  is  encountered  and  its  value  may  be 
obtained  directly,  place  a pointer  to  its  value  in  the  memory  word  indicated 
for  the  result  when  the  function  reference  was  pulled  off  the  global  task 
queue  and  release. 

4.3.3  If  a node  of  form  1 (terminal)  and  type  2 (dummy  parameter)  is 
encountered  and  its  value  cannot  be  obtained  directly,  the  processor  sets 
the  result  destination  to  an  indirect  address  pointing  to  the  address 
indicated  by  the  node  and  releases. 

4.3.4  If  a node  is  of  form  1 (terminal)  and  type  3 (system  function)  or 
type  4 (user  functions),  its  value  cannot  be  obtained  directly  and  the 
processor  goes  into  argument  scan  mode  as  described  below.  Note  that  the 
destination  address  for  the  value  of  the  node  is  currently  set  to  the  desti- 
nation address  of  the  function  being  scanned. 

4.3.5  If  a node  is  of  form  2 (decision  node)  and  type  3 (system  function) 
or  type  4 (user  function),  the  processor  will  generate  a class  II  task  to 
put  on  the  global  task  queue.  This  involves  grabbing  a word  of  memory 

(call  it  P)  for  the  predicate  value  part  of  the  entry.  Then  get  the  pointers 
for  the  two  branches  of  the  decision  node  (T  and  F).  Finally  retrieve  the 
original  destination  address  and  the  argument  list  from  registers.  Once  this 
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is  done,  this  class  II  task  is  put  on  the  task  queue.  Now  start  to  create 
a class  I task  with  destination-address  register  set  to  P.  Obtain  the  op-code 
or  function  address.  Then  go  into  argument  scan  mode  to  create  the  argument 
1 ist. 

4.4.  Argument  Scan  Mode.  The  following  rules  are  followed  by  processors 
in  argument  scan  mode.  In  this  mode  the  current  node  must  be  of  type  3 
(system  functions)  or  type  4 (user  functions). 

4.4.1  Save  the  current  position  by  storing  the  current  destination  address 
and  the  address  of  the  current  node  on  the  processor's  local  queue. 

4.4.2  Scan  the  nodes  in  the  argument  list  of  the  current  node  in 'sequence. 

(a)  For  a node  being  of  type  1 (data  value)  or  type  2 (dummy 
parameter),  obtain  its  address  and  place  it  in  the  argument  table  being 
built  in  local  memory. 

(b)  For  a node  of  type  3 (system  function)  or  type  4 (user  function), 
get  a word  of  main  memory  (its  value  will  be  currently  undefined)  and  place 
its  address  in  the  argument  table  being  built.  The  word  is  where  the  function's 
answer  will  be  stored  later,  so  the  argument  table  entry  will  eventually 
point  to  an  answer.  In  addition,  put  an  entry  on  the  local  queue  consisting 
of  a pointer  to  this  node  and  the  destination  address. 

4.4.3  Back  up  and  evaluate  function  by  removing  the  first  entry  from  the 
local  queue.  If  it  points  to  a node  of  type  (4)  (function-tree),  copy  the 
pointer  to  the  function-tree  from  the  node  and  £u_t  it  on  the  (global)  task 
queue  along  with  the  current  destination  address  and  a pointer  to  the  argument 
table  just  prepared.  On  the  other  hand,  if  the  entry  just  removed  from 
the  local  queue  points  to  a node  of  type  (3)  (system  function),  try  to 
execute  the  system  function.  If  the  execution  is  successful,  place  its  value 
in  the  current  destination  address.  If  it  isn't  successful,  put  its  op-code,, 
the  argument  table,  and  the  destination  address  on  the  (global)  task  queue. 
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4.4.4  Evaluate  the  functions  encountered  in  the  argument  scan.  If  the 
local  queue  is  empty,  release.  Otherwise,  remove  the  first  element. 

That  is,  set  the  PC  to  the  node  address  in  this  element,  and  set  the 
destination  address  register  to  the  destination  address  in  this  element. 
Then  proceed  from  step  1 in  these  argument-scanning  instructions. 

5.  Conclusion 

We  have  discussed  a way  to  organize  a large  number  of  processors 
into  one  computer  system.  The  system  can  be  viewed  as  a single-program 
multiprocessor  network  with  shared  memory  and  a mul ti-instruction,  multi- 
data (MIMD)  stream.  The  system  is  easily  expandable--new  processors  can 
be  added  (or  deleted)  without  modifying  the  overall  structure.  Scheduling 
is  relatively  simple  because  programs  for  the  computer  system  consist 
of  a number  of  asynchronous  subtasks.  Our  immediate  goals  in  future 
work  are  to  develop  a simulator  for  the  system  and  measure  speedups 
attainable  as  a function  of  the  number  of  processors  available  and  the 
tasks  being  performed.  Secondary  goals  include  studying  the  use  of 
memory  by  the  processors  as  they  perform  tasks  in  an  attempt  to  discover 
a way  to  further  reduce  the  complexity  of  the  memory  structure  without 
significantly  degrading  performance. 
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Abstract 


This  paper  reports  Investigations  on  new  ways  to  use  a ring  network  of 
microcomputers  in  which  the  constituent  computers  are  loosely  coupled.  A 
single  program  is  to  be  partitioned  and  run  as  parallel  distributed  tasks  on 
the  network.  The  design  considerations,  hardware  configuration,  software 
facilities,  and  communication  protocol  are  described.  Some  processes,  such 
as  compilation,  simulation,  and  process  control,  which  are  traditionally 
performed  on  uniprocessors  are  considered  for  the  network  computer. 


I 
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I.  Overview 

This  paper  presents  the  design  considerations,  outlines  the  organization 
of  TECHNEC,  the  Illinois  Institute  of  TECHnology  NEtwork  Computer,  and  two  of 
the  initial  ways  in  which  it  will  be  used.  We  call  it  a "network  computer", 
rather  than  a "computer  network",  because  we  are  not  thinking  of  running 
simultaneously  a number  of  Jobs  that  share  network  resources  but  rather  a 
single  job  at  a time,  which  is  distributed  over  a collection  of  microprocessors. 

The  paper  is  organized  as  follows: 

II.  Design  Decisions 

III.  Network  Architecture 

IV.  Software  Facilities 

V.  Application  Software 

VI.  Summary 


II.  Design  Decisions 


There  Is  a general  sentiment  that  It  Is  feasible  and  cost  effective  to 

12  4 

construct  powerful  computer  systems  with  multiple  microprocessors.  * ’ 

However,  a number  of  hardware  and  software  Issues  concerning  Interconnection 

strategy,  communication,  resource  allocation,  and  performance  are  still 
3 5 

unclear.  ' We  shall  address  some  of  these  design  problems  In  this  section 
before  the  TECHNEC  Is  described. 

We  shall  use  the  term  "Interconnected  computer  systems"  to  designate 
computer  structures  composed  of  multiple  processors,  memories,  and  I/O 
devices.  Interconnected  computer  systems  thus  Include  multiprocessor 
systems,  in  which  each  processor  may  access  all  memories  and  I/O  devices, 
and  computer  networks.  In  which  each  computer  accesses  only  Its  own  private 
memory  and  communicates  with  other  computers  with  messages. 

The  construction  of  an  interconnected  computer  system  Is  a series  of 
design  decisions.  Our  considerations  for  the  TECHNEC  are  presented  below. 

1.  Mode  of  Use 

Before  designing  a system  one  must  consider  a strategy  to  exploit  the 
system,  for  this  will  affect  many  of  the  structural  features  of  the  system. 

5 

Three  strategies  for  exploiting  multiprocessor  systems  and  networks  are; 

a)  Single  Special  Task  Environment.  The  system  Is  designed  for  a 
specific  task.  An  outstanding  example  Is  the  PLURIBUS  IMP®,  which 
Is  connected  to  service  one  single  special  function,  namely  message 
switching  for  the  ARPA  net.  Since  the  application  Is  specific,  the 
hardware  and  software  of  the  system  can  be  structured  optimally. 

b)  Standard  User  Environment.  Multiprocessor  systems  such  as  HIS  6180, 
IBM  370/168  MP,  CDC6600,  and  Burroughs  6700  provide  a user 
environment  quite  similar  to  that  of  single  processor  systems.  The 
only  difference  Is  that  they  achieve  higher  throughput,  reliability, 
computing  power,  and  economies  of  scale. 
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Computer  networks  such  as  the  DCS^*®*®  and  the  SPIDER^®  enable 
resources  sharing,  so  that  a user  at  one  computer  may  request 
services  available  on  other  computers  in  the  network.  Programs  may 
be  relocated  by  the  network  dynamically  from  processor  to  processor 
to  achieve  balanced  use  of  resources.  To  the  user,  the  computer 
network  appears  to  be  a single  facility  with  a wider  range  and 
selection  of  services  than  a minicomputer.  I 

i 

c)  Multiple  Specialized  Application  Systems.  In  this  strategy,  a j 

number  of  specialized  systems  are  realized  on  an  interconnected 

computer  system.  Two  examples  are  the  C.mmp^^  and  Computer 
3 

Modules  . Applications  such  as  speech  analysis  and  visual  ] 

recognition  may  be  run  simultaneously  or  sinoly  on  the  C.mmp,  j 

each  application  occupying  one  or  more  processors.  The  architecture  1 

and  communication  path  are  fixed,  although  the  design  is  modular  | 

enough  that  configurations  may  vary  in  the  number  of  processors,  the 
size  of  the  switch,  and  memory  size.  | 

The  designers  of  the  Computer  Modules  went  further  to  permit  \ 

the  sophisticated  user  to  reorganize  the  hardware  structure  and 
the  communication  pattern  for  a soecific  task. 

The  TECHNEC  project  aims  to  develop  multiple  specialized 
applications  on  an  interconnected  computer  system  with  a fixed 
architecture  as  in  C.mmp.  The  system  is  dedicated  to  the  parallel 
execution  of  a single  application  at  a time,  such  as  continuous  as 
well  as  discrete  simulations,  and  complex  process  control— tasks 
which  are  traditionally  executed  on  single  processor  systems. 

( 

i 

2.  System  Characteristics 

The  interconnected  computer  system  should  have  the  following  system 
characteristics: 

a)  Modularity.  The  capacity  of  the  system  should  be  incrementally 
changeable.  There  are  two  measures  of  modularity:  cost  modularity 
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and  place  modularity  . A system  is  cost  modular  if  the  incremental 
cost  of  adding  an  element  such  as  a processor  is  simply  the  cost  of 
the  element.  Place  modularity  refers  to  the  amount  of  freedom  with 
which  an  incremental  element  can  be  inserted  in  the  structure.  The 
interconnected  computer  system  should  be  highly  modular  with  respect 
to  these  two  measures. 

b)  Logical  Complexity.  Since  our  effort  is  mainly  exploratory,  the 
hardware,  software,  and  communication  pattern  should  be  logically 
simple  to  comprehend  and  implement. 

c)  Distributed  Capabilities.  The  architecture  should  lend  itself 
easily  to  distributed  processing.  A single  computational  task  is  to 
be  partitioned  to  run  in  a distributed  fashion  on  the  interconnected 
computer  system. 

d)  Failure  Effect  and  Failures  Reconfigurability.  The  effect  of 
failure  of  a processor  or  a communication  path  on  the  whole  system 
should  be  minimal.  It  should  be  easy  to  reconfigure  the  system 
after  failure.  One  approach  is  to  have  interchangeable  components 
for  ease  of  maintenance. 

3.  Interconnection 

12 

The  taxonomy  of  interconnection  by  Anderson  is  instrumental  in 
formulating  design  problems  for  the  TECHNEC.  Ten  interconnection  designs 
were  identified  in  the  taxonomy.  The  first  choice  in  interconnection 
strategy  is  between  direct  and  indirect  transmission  of  messages  from  source 
to  destination.  In  direct  communication,  a path  in  the  form  of  a link,  a bus, 
or  a memory  connects  two  processors.  In  indirect  communication,  an  intervener 
exists  to  alter  the  messages  (e.g.,  address  transformation)  or  to  route  the 
message  onto  one  of  a number  of  alternative  output  paths.  For  example,  in 
the  SPIDER,  which  is  a loop  network  with  a central  switch,  two  processors 
intending  to  engage  in  communication  must  first  report  to  the  central  switch. 
Each  message  is  sent  to  the  central  switch  with  the  sender's  identification. 
The  central  switch  then  routes  the  message  to  the  receiver  with  the  receiver's 
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other  examples  of  intervenors  in  indirect  communication  are  the  central 
switching  node  in  a star  network  and  the  memory  manping  functions  across 

buses  in  Computer  Modules,  which  are  interconnected  bus  systems.  Accessing  | 

a memory  on  a distant  bus  in  Computer  Modules  takes  variable  time  which 

depends  on  the  number  of  intermediate  mappings.  Moreover,  interconnected  | 

13  ‘ 

bus  systems  are  susceptible  to  deadlocks  . i 

j 

] 

Because  of  the  relative  cf'mplexity  in  indirect  communication,  we  have 
decided  to  make  use  of  direct  communication. 

The  second  decision  to  make  with  respect  to  interconnection  involves  the 
choice  of  shared  or  dedicated  message  transfer  paths.  A path  is  considered 
to  be  shared  if  it  is  accessible  from  more  than  two  points  at  the  same  time. 

Two  techniques  provide  a shared  transfer  path:  global  bus  or 
multiprocessor.  Access  to  the  global  bus  is  shared  among  the  processors  by 
some  arbitration  scheme.  Messages  are  sent  from  the  source  processor  onto 
the  bus,  to  be  recognized  and  accepted  by  the  proper  destination.  It  is 
simple  to  add  additional  processors,  but  cost  modularity  and  place 
modularity  of  the  bus  are  poor.  It  is  impossible  to  increase  the  bandwidth, 
and  reduction  of  contention  often  necessitates  multiple  buses  or  bus 
replacement.  Failure  of  individual  processors  does  not  cause  serious  impact, 
because  multiple  processors  are  available.  It  is  simple  to  amputate  faulty 
processors  but  bus  failure  is  disastrous. 

In  multiprocessor  architectures,  two  or  more  processors  share  a cotmon 
physical  memory  space.  As  a side  effect,  this  conrnon  memory  is  used  for 
message  communication.  The  simplest  common  memory  access  mechanism  is  a bus. 
Implementations  more  commonly  use  multiple  memory  buses  or  very  expensive 
central  switches.  It  has  been  found  that  the  performance  of  multiprocessor 
systems  increased  more  slowly  as  the  number  of  processors  increased  because 
of  contention  for  memory  bandwidth.  It  appears  that  effective  use  of  such 
systems  requires  fundamental  understanding  of  program  partitioning.  The 
cost  of  the  central  switch  is  also  out  of  our  reach. 
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Two  Implementation  alternatives  are  available  in  the  dedicated  path 
approach:  complete  interconnection  and  the  ring.  In  the  complete  inter- 
connection approach,  each  processor  is  connected  directly  to  every  other 
processor,  and  messages  are  exchanged  between  any  pair  of  processors  along 
the  dedicated  link.  This  organization  has  the  advantages  of  logical 
simplicity  and  the  capability  of  accommodating  any  communication  pattern 
among  processors.  Cost-modularity,  however,  is  very  poor.  Addition  of  a 
processor  to  an  n-processor  system  requires  the  addition  of  n paths,  and 
all  the  existing  processors  must  be  made  ready  to  handle  incoming  messages 
from  the  incremental  processor. 

We  are  left  with  one  candidate  structure:  the  ring,  which  provides  a 
single  direct,  dedicated  communication  path  between  adjacent  pairs  of 
processors.  Our  decisions  to  adopt  the  ring  structure  are  as  follows: 

a)  Modularity.  The  cost-modularity  and  place-modularity  of  the  ring 
are  very  good.  The  cost  of  adding  a node  is  the  cost  of  a processor 
and  the  associated  portion  of  communication  path. 

b)  Logical  Complexity.  The  logical  complexity  of  the  ring  is  very  low. 
Each  node  communicates  with  only  its  predecessor  and  its  successor. 
The  hardware  and  software  for  each  node  can  be  made  completely 
identical . 

c)  Distributed  Processing.  Although  programs  are  relocatable  across 
processors  in  the  DCS  or  SPIDER,  the  execution  of  a program  is 
essentially  localized  at  one  specific  location.  Distributed 
processing  is  straightforward  in  multiprocessor  systems  via  the 
common  memory,  but  protection  and  system  integrity  are  serious 
problems. 

The  concept  of  a network  computer  intrigues  us,  namely  to  run 
a program  in  a distributed  manner  on  a network  of  loosely  coupled 
computers,  which  behaves  as  a powerful  computer  system.  The  loose 
coupling  isolates  portions  of  a program  and  enhances  system 
integrity.  This  concept  may  be  advantageously  employed  in  future 
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systems  in  which  more  intelligence  is  added  to  peripheral 
processors.  The  price  paid  is  the  relative  slow  communication  rate 
between  processors.  Partitioning  of  program  and  data  becomes  an 
important  problem  in  using  such  a system  effectively. 

The  ring  provides  a straightforward  environment  for 
investigations  on  pipelined  processing. 

d)  Failure  Effect  and  Failure  Reconfigurability.  Failure  effect  of  a 
processor  is  minimal,  as  in  the  global  bus  approach,  but  the  failure 
effect  of  the  ring  communication  path  is  not  as  bad  because  the 
communication  path  can  be  implemented  with  microprocessors.  The 
probability  of  failure  of  all  the  processors  is  rather  small. 

Traditionally,  reconfiguration  after  failure  has  not  been  | 

performed  on  ring  networks  such  as  DCS  and  SPIDER,  which  are 
constructed  to  link  geographically  dispersed  computer  systems  for  , 

file  and  resource  sharing.  But  in  our  intended  use,  the  ring  can  be 
reconfigured  easily  to  a smaller  ring  by  removing  failed  processors  ' 

or  communication  microprocessors.  j 

e)  Performance  Improvement.  Traditional  ring  networks  use  unidirectional,  j 

bit-serial  interfaces.  Byte-parallel  interfaces  are  available  for 
implementation  of  communication  path  with  microprocessors.  Moreover, 

the  bandwidth  can  be  improved  by  refinement  of  message  protocol  and 
message  buffering  at  communication  processors.  Bidirectional  message 
communication  may  also  be  attempted. 

4.  Communication 

The  design  of  a message  protocol  for  a ring  network  needs  to  consider  the 
following  factors:  direction  of  communication,  number  of  receivers  per  message, 
message  length,  coupling  between  sender  and  receiver,  and  message  buffering. 

Unidirectional  communication  was  chosen  because  of  the  complexity  of 
bidirectional  communication. 
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One-to-many  (broadcast)  mode  Is  desirable  as  a control  mechanism  to 

synchronize  parallel  processes  in  applications  such  as  simulation  and 
15 

demon  control.  Point-to-point  communication  is  the  basic  mode  of 
correspondence  between  processes. 

The  message  size  is  often  dependent  on  the  nature  of  an  application. 
Variable  message  size  is  appropriate  since  fixed  size  packets  are  wasteful 
for  short  messages  which  are  the  majority  and  an  elaborate  disassembly- 
assembly  mechanism  is  necessary  to  handle  messages  longer  than  a packet. 

!’  The  DCS  identifies  processes  by  unique  names.  A message  is  addressed 

to  a process,  typically  a compiler  or  an  editor,  by  name  rather  than  location 
because  processes  are  moved  dynamically  within  the  network  to  achieve 
balanced  loading.  A requesting  process  does  not  and  should  not  know  the 
location  of  the  receiver  process.  In  the  TECHNEC,  communication 
overhead  is  incurred,  as  a result  of  program  partitioning,  by 
allocating  interdependent  processes  to  separate  computers.  The 
communication  pattern  is  predetermined  before  execution.  The  processes  are 
nameless  but  some  form  of  identification  is  essential  to  identify  the  logical 
communication  path.  Virtual  channels  are  assigned  for  each  logical 
communication  path  to  delay  process-processor  binding.  Moreover,  the  concept 
( of  virtual  channels  extends  easily  for  the  broadcast  mode. 

The  TECHNEC  also  differs  from  the  DCS  in  message  buffering.  In  the 
I latter,  outgoing  messages  are  queued  in  an  output  queue  waiting  to  get  or  the 

ring.  A message,  on  arriving  at  the  destination  processor,  is  entered  into 
the  input  queue  of  the  receiver  process.  On  the  TECHNEC,  a sender  must  have 
enough  space  for  all  the  outgoing  messages  and  must  keep  a copy  of  a message 
until  the  message  has  made  a round  trip  around  the  ring.  This  policy 
effectively  prevents  a process  from  using  up  all  the  free  space  at  a receiver. 

5.  Deadlocks 

Deadlocks  can  occur  when  multiple  messages,  each  occupying  a part  of  the 

13 

comnunicatlon  path,  attempt  to  access  the  part  occupied  by  another  message. 

14 

The  TECHNEC  is  basically  a Newhall  loop  that  permits  at  nost  one  message  to 
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be  transmitted  at  a time. 


6.  Resource  Allocation 


The  major  resources  of  the  network  are  processors  and  local  memories. 
Since  each  processor  accesses  exclusively  Its  local  memory,  the  two  resources 
are  one.  The  resource  allocation  problem  becomes  one  of  program  partitioning 
so  that  the  overall  execution  time  of  the  program  Is  minimal. 

7.  Process  to  Processor  Binding 

A process  will  be  bound  to  a processor  and  physical  memory  to  simplify 
network  software  and  memory  manCpernent.  A scheduler  Is  necessary  for 
multiple  processes  coexisting  witi  In  the  same  processor.  The  use  of 
virtual  channels  Instead  of  processor  Identification  and  locations  delays  the 
binding  and  facilitates  processor  assignment. 


Ill,  Network  Architectures 


As  was  mentioned  earlier,  TECHNEC  Is  a ring  of  12  nodes  (Fig.  1.)  A node 
Is  composed  of  a Ring  Interface  Unit  (RIU)  and  a Microprocessing  Unit  (MPU). 
The  RIUs,  linked  by  I/O  ports,  are  responsible  for  message  communication 
between  nodes. 

Both  the  MPU  and  the  RIU  are  Implemented  by  microcomputers  for  economic 
reasons.  Also  one  of  our  main  objectives  In  this  project  Is  to  Investigate 
ways  of  utilizing  a network  of  Inexpensive  microcomputers  to  perform 
applications  such  as  the  examples  sketched  In  Section  V. 

Each  MPU  Is  Implemented  by  an  LSI-11  with  at  least  12K  words  (up  to 
56K  bytes)  of  RAM  and  floating-point  hardware.  LSI-11  Was  chosen  mainly 
because  of  the  considerable  amount  of  system  software  support  accessible  to 
LSI-11  users  and  availability  of  floating  point  Instructions,  which  are 
essential  to  our  applications.  Moreover,  the  LSI-11  assembly  language  Is 
powerful  enough  for  software  development.  Another  factor  Is  our  familiarity 
with  PDP-11  assembly  language  and  its  general  structure.  This  relative  ease 
of  software  development  allows  the  research  group  to  concentrate  on 
Investigations  In  distributed  processing. 

Each  RIU  Is  Implemented  by  an  8-b1t  RCA  COSMAC  with  IK  bytes  of  RAM. 

Each  RIU  provides  two  functions.  It  serves  as  an  Interface  between  the  MPU 
and  the  ring,  and  also  communicates  with  adjacent  RIUs.  The  RCA  COSMAC  was 
chosen  to  Implement  the  RIU  because  It  Is  a "low  end"  device  suitable  for 
simple  applications  with  limited  programming  needs.  Being  fabricated  with 
CMOS  technology.  It  requires  very  little  power.  It  operates  with  a single 
power  supply  (between  +3V  and  +12V  ) and  Is  very  Insensitive  to  noise.  Four 
flags,  which  are  connected  directly  to  CPU  pins,  can  be  set  or  reset  by 
external  logic  to  control  actions  of  the  COSMAC  CPU.  For  example  the  EF3 
flag  can  be  controlled  by  the  MPU  and  one  COSMAC  Instruction  Is  needed  to 
branch  on  high  or  low  condition  of  the  flag.  The  COSMAC  also  provides  a 
DMA  facility  suitable  for  easy,  efficient, and  automatic  program  loading  and 
lata  transfers.  The  COSMAC  Instruction  repertoire,  although 
■Ued,  1$  sufficient  for  Implementing  message  protocols.  Moreover,  since 


the  RIU  1s  transparent  to  the  TECHNEC  user,  powerful  progranrting  facilities 
are  not  required.  These  factors  made  the  COSMAC  very  attractive  as  a 
candidate  In  RIU  Implementation. 


An  RIU  communicates  over  an  8-b1t  data  bus  to  which  two  Interfaces  are 
connected:  RIU  - RIU  Interface  and  RIU  - MPU  Interface  (Fig.  2). 

RIU  - RIU  Interface 

Conmuni cation  between  two  adjacent  RIUs  Is  strictly  asynchronous,  byte- 
parallel,  and  unidirectional.  RIU(n+l)  and  RlU(n)  always  have  a master-slave 
relation,  communicating  via  an  I/O  port  that  Is  realized  by  an  INTEL-8212 
chip.  An  I/O  port  Is  Implemented  as  a memory-mapped  device  In  the  RIU's 
address  space.  The  EFl  flag  of  RIU(n+l)  and  the  EF2  flag  of  RlU(n)  (Fig.  3), 
are  connected  to  the  INT  output  of  the  8212  chip.  In  the  quiescent  state, 
both  flags  are  set  high.  Each  RIU  continually  pools  Its  Input  oort  for  data 
arrival  from  the  previous  RIU.  When  RlU(n)  writes  Into  Its  output  port,  the 
EFl  flag  of  RIU(n+l)  and  the  EF2  flag  of  RlU(n)  are  pulled  low.  The  EF2 
flag  remains  low  until  RlU(n+l)  reads  from  Its  Input  port.  This  Interlock 
prevents  RlU(n)  from  writing  further  data  Into  Its  output  port.  The  EFl 
flag  at  the  low  condition  Indicates  to  RIU(n+l)  that  It  may  read.  A read 
operation  by  RIU(n+l)  from  Its  Input  port  removes  the  byte  and  sets  both 
flags  high,  enabling  RlU(n)  to  write  again. 

RIU  - HPU  Interface 

There  are  two  types  of  Interfaces  between  an  RIU  and  the  corresponding 

MPU: 

a)  Register  Interface.  This  Is  a normal  PDP-11  Interface.  There  Is  a 
Control  and  Status  Register  (CSR)  and  a Data  Buffer  Register  (DBR), 
An  RIU  and  Its  corresponding  MPU  can  communicate  by  setting  control 
and  status  bits  (operation  code)  In  the  CSR.  The  DBR  Is  often  used 
for  passing  addresses  between  MPU  and  RIU, 

b)  Direct  Memory  Access  (DMA)  Interface.  This  Interface  Is  controlled 
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Figure-3  COSMAC-COSMAC  Interface 


by  the  RIU.  The  appropriate  address  to  be  read  (written)  from  (to) 
Is  first  entered  In  a special  register  at  addresses  UOOOOg  and 
140001g.  Another  pair  of  registers  (140004g,  140005g  for  Input  and 
140002g,  140003g  for  output)  are  reserved  for  read  and  write  DMA 
operations.  These  addresses  are  assigned  because  we  use  the  most 
significant  bit  of  the  16-b1t  COSMAC  address  for  memory-mapped  I/O 
operations,  and  the  second  most  significant  bit  for  DMA  operations. 
The  addresses  are  decoded  by  a 74LS138  chip.  An  RIU  Initiates  a 
DMA  operation  by  writing  a special  operation  code  (2  for  Interrupts 
4 for  DMA-wrIte,  and  5 for  DMA-read),  In  the  memory-mapped  device 
address  140007g.  This  operation  code  serves  as  a DMA  strobe  signal 
Once  a DMA  operation  is  Initiated,  a custom-made  diode  matrix 
utilizes  the  DMA  facility  on  the  LSI-11  and  performs  the  required 
"handshaking"  with  the  LSI-11  bus  for  cycle  stealing.  The  COSMAC 
software  Is  responsible  for  repeating  the  process  until  a complete 
message  Is  transferred.  All  message  transfers  between  RIU  and  MPU 
are  performed  via  the  DMA  mode,  once  It  Is  Initiated  using  the 
register  Interface  logic. 
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IV.  Software  Facilities  j 

1 

The  most  Important  software  facility  on  TECHNEC  Is  the  message 
communication  routine.  There  Is  a message  manager  routine  (MSGMAN)  In  i 

each  MPU  and  In  each  RIU  a software  module,  MSGBOY,  which  cooperates  ] 

with  the  corresponding  MSGMAN  in  message  transmission  and  reception.  Other 
facilities  Include  multitasking  dispatching,  console  communication,  file 
management  and  debugging.  More  space  will  be  dedicated  to  the  description 
of  the  message  communication  mechanism  since  the  other  facilities  are  rather 
conv>»nt1onal . 

Communication 

Nodes  communicate  using  a unidirectional  protocol.  A message 
passes  through  each  node  and  Is  removed  from  the  ring  when  It  Is  received 
In  the  node  which  originated  It. 

The  design  of  the  message  protocol  allows  the  possibility  of  having  more 
than  one  message  passing  around  the  ring  at  a time;  this  implies  a buffering 
capability  within  each  node.  Owing  to  lack  of  buffering,  only  one  message  Is 
active  on  the  ring  at  any  time  In  our  current  Implementation. 

Messages  are  passed  over  a group  of  "virtual"  channels.  The  system 
currently  has  no  dynamic  channel  assignment  facility,  so  the  allocation  of 
channels  must  be  done  by  the  users.  Certain  channels  are  pre-assigned  for 
Internal  system  functions  (such  as  program  loading  and  debugging  operations). 

Each  request  to  the  message  passing  system  specifies  the  channel  number  over 
which  the  message  Is  to  be  sent  or  received. 

Two  styles  of  messages  are  supported:  "point-to-point"  and  "broadcast" 
messages.  The  point-to-point  style  of  message  Is  the  simpler  to  use.  Each 
message  request  Includes  a "subchannel"  number  In  addition  to  the  channel 
number.  It  Is  assumed  that  only  one  task  will  actually  be  waiting  for  a 
message  over  the  channel /subchannel  number  combination.  This  style  Is  more 
frequently  used  by  existing  programs.  The  channel/subchannel  number 
combination  Is  effectively  the  name  of  a task  In  a point-to-point  communication. 
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The  "broadcast"  message  style  was  Included  to  allow  the  one-to-many  mode  of 
communication.  In  broadcast  usage,  the  sending  task  does  not  know  what 
task(s)  may  be  waiting  for  a message  on  the  channel;  indeed,  there  may  be 
no  tasks  currently  waiting.  Likewise,  a receiving  task  has  no  way  of 
knowing  what  other  tasks  also  received  a copy  of  the  broadcast.  Also,  the 
system  does  not  provide  any  identification  of  the  sender  in  the  orotocol , though 
this  could  be  included  within  the  data  portion  of  the  message  if  desired. 

The  system  uses  dynamic  buffer  allocation  for  the  receipt  of  broadcast 
messages.  As  many  tasks  as  wish  to  receive  a broadcast  message  over  a single 
channel  are  allowed  by  MSGMAN  to  share  a single  copy.  The  buffer  is  not  freed 
until  all  such  tasks  indicate  that  they  no  longer  need  the  message.  Each 
message  contains  a routing  field  that  is  read  and  reset  by  receivers  and  is 
used  to  convey  reception  status  information  to  the  sender. 

The  message  passing  operations  are  performed  by  MSGBOY.  It  is 
responsible  for  receiving  messages  from  the  preceding  RIU,  interrupting  its 
MPU  to  indicate  the  arrival  of  messages,  obtaining  data  from  its  MPU  by  DMA 
mode  for  transmission,  passing  messages  to  the  succeeding  node, and  other 
functions  such  as  loading  code  into  its  MPU  and  dumping  the  contents  of  its 
MPU.  The  MSGBOY  is  invisible  to  the  user  task  in  the  MPU. 

User  tasks  in  MPU  make  requests  for  service  to  MSGMAN, which  maintains 
tables  and  queues  indicating  the  state  of  the  message  passing  mechanism. 

It  also  contains  an  entry  point  which  services  the  RIU  interrupt.  Whenever 
the  MSGBOY  needs  a decision  regarding  what  to  do,  it  interrupts  the  MPU, 
passing  a code  through  the  status  register  telling  what  is  required.  MSGMAN 
then  examines  its  tables  and  returns  a code  to  the  RIU,  informing  it  what  to 
do. 


When  the  RIU  examines  the  CSR  and  finds  that  it  should  copy  the  message 
into  the  MPU,  It  will  proceed  without  any  further  interaction  with  the  MPU 
until  the  end  of  the  message  is  detected.  At  this  point  it  will  generate 
another  interrupt.  MSGMAN  will  then  update  its  tables,  unblock  any  waiting 
tasks,  etc. 
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If  the  RIU  Is  not  to  copy  the  message,  It  simply  passes  the  message 
through.  In  either  case.  It  will  appropriately  update  the  routing  field. 

When  the  RIU  detects  the  "caboose"  of  a possible  train  of  messages.  It 
will  generate  an  Interrupt  to  Inform  the  MPU  that  a message  can  be 
transmitted.  MSGMAN  will  Interrogate  Its  tables  and  Inform  the  RIU  as  to 
whether  a message  Is  to  be  sent  or  not.  If  yes,  then  the  RIU  will  add  the 
message  to  the  ring.  When  It  detects  the  message  coming  In  (having  passed 
around  the  ring).  It  removes  the  message  (i.e.,  does  not  pass  It  on)  and 
Interrupts  the  MPU  to  allow  MSGflAN  to  again  update  Its  tables. 

Improvements  are  being  considered  to  achieve  higher  communication 
bandwidths  by  adding  buffering  at  the  RIUs.  Multiple  messages  can  then 
be  transmitted  simultaneously.  Additional  memory  to  accommodate  the  list 
of  channel  numbers  at  each  RIU  would  also  enable  the  RIU  to  take  a more 
active  role  In  deciding  on  appropriate  actions  on  a message  Instead  of 
Interrupting  the  MPU  each  time.  Since  the  list  of  channel  numbers  may  be 
changed  dynamically,  the  capability  of  an  MPU  to  Interrupt  Its  RIU  Is  also 
desirable. 

SEXTECH  (System  Executive) 

A program  running  on  TECHNEC  Is  to  be  structured  as  one  or  more  tasks. 

On  each  MPU  there  Is  a resident  multitasking  dispatcher  called  SEXTECH. 

The  SEXTECH  allows  multiple  tasks  to  reside  In  a MPU  and  schedules  user  tasks 
In  a simple  round  robin  fashion. 

SUPERTECH 


The  system  node  provides  utility  functions.  It  interacts  with  the  user 
terminal,  floppy  disks,  and  phone  lines  to  other  computers  on  campus.  This 
node  runs  a collection  of  facilities  which  are  jointly  called  SUPERTECH.  One 
such  facility  (CONMAN)  allows  user  tasks  to  Interact  with  the  system  terminal. 
Another  (FILMAN)  allows  user  tasks  to  access  files  stored  on  the  floppy  disks. 
Yet  another  (DEBUG)  Interacts  with  the  user  (using  CONMAN)  to  allow  debugging 
operations  to  be  performed  over  the  ring.  This  Is  absolutely  necessary  since 
* the  user  nodes  have  no  terminals,  no  display  lights,  etc.  The  only  way  to 
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examine  the  contents  of  memory  in  user  nodes  is  via  the  DEBUG  facility. 
User  tasks  request  actions  by  these  facilities  by  sending  messages  over 
special  channels. 

The  DEBUG  module  provides  functions  such  as  suspension  of  a task, 
resumption  of  a task,  modifying  contents  of  a location,  display  of  status, 
and  breakpoints. 
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V.  Application  Software 
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Several  ongoing  projects  are  developing  software  geared  to  the  operation  | 

of  the  network  computers.  These  projects  Include  pipelined  compiling^^,  real-  i 

21  20  i 

time  control  of  complex  processes,  discrete  and  continuous  simulations, 

22  * 

demon  language,  and  a FORTRAN  compiler  . i 

A pipelined  compiler  which  produces  distributed  object  code  is  described 
in  this  section  to  illustrate  the  network  mode  of  operation  and  techniques  to 
best  utilize  the  network  computer.  A style  of  real-time  control  of  complex 
processes  appropriate  to  the  network  computer  will  also  be  presented. 

Pipelined  Compilation  and  Problem  Decomposition 

To  achieve  efficient  use  of  the  network  computer,  the  following 
strategies  should  be  attempted: 

a)  A program  must  be  partitioned  into  a number  of  parallel  tasks. 

b)  There  must  be  at  least  as  many  tasks  as  constituent  computers. 

c)  Yhe  tasks  may  form  clusters  that  have  their  own  data  structures, 
but  between  such  clusters  tasks  must  communicate  only  by  sending 
and  receiving  messages  since  there  is  no  common  memory  among 
computers. 

d)  The  memory  requirements  of  each  cluster  must  not  exceed  the  memory 
capacity  of  the  constituent  computer  since  memory  management  may  be 
inefficient  and  difficult. 

e)  The  amount  of  data  that  must  be  exchanged  between  clusters  should 
be  minimized  to  avoid  saturating  the  communication  channels. 

f)  The  processing  requirements  among  computers  should  be  balanced  and 
overlapped  as  much  as  possible,  to  minimize  the  overall  execution 
time  of  the  program. 

These  strategies  can  be  illustrated  by  the  compile-time  and  runtime 
environments  of  a DYNAMO  compiler  on  the  TECHNEC.  Each  compilation 
phase  resides  on  a separate  computer  of  TECHNEC.  A source  program  is  first 
entered  from  a console  to  a file  at  the  system  node.  Statements  of  the 
source  program  pass  through  the  phases  in  order,  with  no  feedback  required. 
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so  we  may  consider  the  compiler  to  be  pipelined  in  the  same  sense  as  pipe- 
lined arithmetic  units.  The  generated  object  code  is  decomposed  into  clusters 
by  a partitioning  program  and  stored  in  a file  at  the  system  node.  When 
program  execution  is  initiated,  the  clusters  are  loaded  into  the  assigned 
processors.  The  program  runs  in  a distributed  mode. 

Problem  decomposition  at  compile-time  and  runtime  is  achieved  by  two 
approaches:  functional  partitioning  and  partitioning  by  data  dependency. 

The  first  approach  divides  a program  into  manageable  portions  according 
to  its  functions  with  minimum  message  communication  and  maximum  use  of  the 
available  parallelism.  The  compilation  process  is  organized  as  a pipeline. 
Three  major  constraints  are  thus  imposed  on  the  design.  First,  to  keep  the 
pipeline  busy,  each  phase  processes  one  statement  at  a time.  Once  the 
statement  is  processed,  the  statement  in  its  new  converted  form  is  passed 
to  the  next  phase.  Neither  the  input  form  nor  the  converted  form  will  be 
available  to  the  phase  for  future  processing.  Second,  since  the  nodes  of 
TECHNEC  do  not  share  memory,  individual  phases  cannot  access  global  tables. 
Information  derived  by  each  phase  should  be  imbedded  in  the  internal  code 
which  is  routed  to  succeeding  phases.  Third,  although  the  total  memory  of 
the  network  is  144K  words,  each  phase  can  access  only  its  12K  words  of  local 
memory. 

The  second  approach  partitions  a program  into  parallel  clusters 
according  to  data  dependencies  between  tasks  of  the  prograiT\,  but  the 
functional  properties  of  the  program  are  not  taken  into  consideration. 
Execution  of  clusters  in  parallel  reduces  execution  time,  but  message  passing 
and  a potential  wait  are  necessitated  when  two  dependent  tasks  are  assigned 
to  different  clusters.  For  example,  a compile-time  module  of  the  DYNAMO 
compiler  partitions  the  object  code  into  clusters  so  that  the  overall 
execution  time  of  the  program  is  minimum. 

A brief  introduction  to  DYNAMO  will  be  given  before  the  pipelined 
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compiler  is  described.  DYNAMO  , as  a continuous  simulation  language,  models 
a system  as  a set  of  variables  called  LEVELS  and  their  rates  of  change  called 
RATES.  The  changes  of  LEVELS  and  RATES  with  respect  to  time  are  expressed  as 


a set  of  difference  equations.  Another  class  of  variables,  AUXILIARYs, are 
used  to  help  specify  complex  relationships  between  LEVEL  and  RATE  variables. 

A fourth  class  of  variables,  SUPPLEMENTARYs, are  used  solely  in  output 
statements. 

DYNAMO  is  nonsequential— i.e. , statements  can  be  written  in  any  order 
without  affecting  the  outcome  of  the  program.  The  qe.ieral  pattern  of  execu- 
tion of  one  cycle  of  simulation  is  first  LEVELS,  then  AUXILIARYs  and  RATES. 

At  certain  specified  time  points,  SUPPLEMENTARY  variables  are  calculated. 

LEVEL  statements  are  independent  of  one  another  and  can  be  executed  in  any 
order:  and  the  same  is  true  of  RATE  equations  when  they  are  later  executed. 
AUXILIARY  variables,  however,  are  usually  interdependent. 

The  compiler  consists  of  eight  phases,  as  shown  in  Fig.  4.  The  first 
four  stages  are  quite  conventional.  The  Scanner  converts  input  text  into 
internal  tokens  and  the  Macro  Expansion  Module  expands  all  macro  calls  to 
systetiKor  user-defined  macros  into  internal  token  strings.  The  input  text, 
together  with  errors  discovered  by  the  Scanner  and  Macro  Expansion  Module  is 
passed  to  a file  for  listing.  The  statement  in  token  form  is  then  passed  to 
the  next  phase.  Symbol  Table  Manipulation,  as  a message.  The  Symbol  Table 
Manipulation  Routine  converts  each  identifier  token  into  a unique  index  into 
the  Symbol  Table.  Since  statements  are  nonsequential,  an  identifier  may  be 
used  in  the  right  hand  side  (RHS)  of  a statement  before  it  appears  on  the 
left  hand  side  (LHS)  of  a statement.  A data  structure  showing  the 
dependency  among  statements  is  necessary  to  recognize  undefined  identifiers 
and  doubly  defined  identifiers.  The  parser  converts  each  internal  token 
string  into  Polish  suffix  form  using  a transition  matrix. 

Since  DYNAMO  is  non-sequential , the  Sequencing  Module  determines  the 
order  of  execution  of  statements  and  initialization  equations  of  some  identifiers. 
This  Sequencing  order  is  determined  by  building  successor  lists,  one  for  each 
identifier.  Thus,  an  identifier  on  the  LHS  of  a statement  is  a successor  of 
each  of  the  identifiers  on  the  RHS,  since  the  values  of  the  RHS  identifiers 
must  be  known  before  the  LHS  identifier  can  be  calculated.  When  all  the 
statements  are  encountered,  a topological  sort  program  is  run.  The  output  of 
the  topological  sort  program  is  the  sequencing  order. 
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Figure  4.  Structure  of  the  Pipelined  DYNAMO  Compiler. 
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The  Sequencing  Module  can  determine  the  sequencing  order  only  after 
all  statements  are  encountered.  But  both  Its  predecessor,  the  Parser,  and 
Its  successor,  the  Code  Generator,  function  on  a statement-by-statement 
basis.  To  utilize  the  full  capacity  of  the  pipeline,  the  Sequencing  Module 
passes  the  Polish  Suffix  String  to  the  Code  Generator  after  extracting 
dependency  information  from  It.  The  Code  Generator  transforms  the  Polish 
suffix  string  to  a closed  subroutine.  The  sequencing  order,  when  supplied 
by  the  Sequencing  Module,  will  be  coded  as  subroutine  calls. 

In  addition  to  generating  code,  the  Code  Generator  reports  memory 
requirement  and  execution  time  of  each  statement  to  the  Partitioning  Module. 

The  Partitioning  Module  aims  to  divide  the  object  code  Into  clusters  of 
subroutines  together  with  the  associated  calls.  A mixed  Integer  linear 
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programming  model  produces  optimal  partitioning  for  all  program  structures. 

The  model  assumes  that  every  task  (statement)  is  assigned  to  one  and  only  one 

processor;  and  that  communication  time.  If  any,  and  execution  time  for  each 

statement  are  known.  No  cluster  exceeds  the  available  main  memory 

In  size.  The  objective  is  to  obtain  a partition  that  will  produce  total 
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minimum  execution  time  for  each  simulation  cycle.  Other  heuristics  , 
based  on  the  data  dependency  partitioning  approach,  are  being  tested  against 
this  optimal,  but  expensive,  model. 

One  heuristic  Is  to  first  assign  a LEVEL  quantity  to  a cluster  of  its 
own  and  gather  Its  predecessors  to  the  same  cluster.  If  a quantity  Q Is  the 
predecessor  of  more  than  one  LEVEL,  there  are  the  options  of  duplicating  Q 
and  all  predecessors  of  Q for  each  LEVEL,  or  assigning  Q to  one  of  the  LEVELS 
and  providing  communication  primitives  to  send  the  value  of  Q to  others.  In 
the  second  option,  the  "strength"  of  the  dependency  determines  the  assignment. 
The  output  of  this  phase  will  be  statement  subroutines,  their  calls,  a list 
of  processor  assignments  and  appropriate  communication  conmands.  The  latter 
two  items  of  Information  are  used  by  the  loader  for  allocation  of  object 
code  to  processors. 

The  Symbol  Table,  the  Sequencing  Module,  and  the  Partitioning  Module  all 
build  and  analyze  data  structures  to  determine  dependency  among  statements. 
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Instead  of  a single  complex  data  structure,  each  phase  builds  a separate 
local  data  structure.  A centrally  located  data  structure  requires 
considerable  accessing  time  while  separate  local  data  structures  are  more 
efficient  and  simpler  because  they  are  tailored  for  the  specific  needs  of 
Individual  modules.  Moreover,  all  three  data  structures  are  constructed  and 
analyzed  simultaneously. 

At  compile  time,  a statement  In  its  Internal  form  can  be  transmitted 
from  one  phase  to  another  by  point-to-point  messages.  The  broadcast  mode 
Is  convenient  for  disabling  succeeding  phases  when  a fatal  error  Is  detected. 

At  runtime,  clusters  on  all  processors  are  synchronized  to  start 
executing  the  same  simulation  cycle.  The  cycle  synchronization  signal  can  be 
generated  by  a signaling  process  which  emits  a broadcast  message  to  poll  each 
cluster's  readiness  to  start  a new  cycle.  Message  transfers  between  clusters 
within  the  same  cycle  are  carried  out  asynchronously  by  point-to-point 
messages.  Readiness  to  start  a new  cycle  Implies  completion  of  all 
executions  and  arrivals  of  all  necessary  values.  If  any,  from  other  clusters 
for  the  next  cycle. 


Complex  Process  Control 


A major  purpose  of  the  TECHNEC  Project  is  to  use  our  network  computer 
in  developing  a coherent  style  for  controlling  large-scale  electronic, 
mechanical,  or  chemical  systems  in  which  so  many  variables  must  be  regulated 
that  real-time  computation  becomes  difficult  or  impossible  for  a conventional 
computer  of  affordable  size  and  speed.  In  general,  we  seek  ways  of 
delegating  responsibility  for  control  decisions  to  processors  in  a 
heterarchical  control  network,  so  that  a program  currently  assuming  an 
executive  role  will  not  be  overwhelmed  by  having  to  control  everything  at 
once. 


Skilled  engineers  use  a variety  of  tricks  and  know-how  to  make  such 
multivariable  processes  work.  Well  known  examples^®’^^  show  many  ways  in 
which  complex  tasks  requiring  accurate  coordination  of  many  variables  are 
best  performed  by  distributed  controllers,  each  handling  a stage  of  rough 
computation,  to  avoid  overwhelming  an  executive  with  the  need  to  regulate  a 
large  number  of  variables  at  once.  Underlying  a large  collection  of  these 
seemingly  ad  hoc  engineering  tricks  for  making  complicated  biological  and 
artificial  systems  work  may  be  discerned^®, a coherent  style  for  controlling 
large-scale  systems,  in  which  authority  and  computations  are  delegated  to 
loosely  coupled  subsystems  that  can  fake  what  the  system  needs,  each, 
perhaps,  in  a limited  set  of  circumstances.  The  control  problem  consists 
in  coordinating  all  these  partial  solutions  into  a coherent  implementation 
of  the  solution  in  a wide  variety  of  circumstances.  Variety  of  response  is 
achieved  by  adjusting,  or  "tuning"  the  partial  solutions.  In  this  way,  the 
command  for  action  can  be  a simple  one  that  does  not  have  to  specify  the 
precise  variant  of  the  action  to  be  performed. 

The  partial  solutions  can  be  patterns  of  feedforward  that  maintain 
subsystems  close  enough  to  their  desired  configurations,  so  that  very  simple 
feedback  mechanisms  can  achieve  the  small  remaining  corrections.  This 
information-handling  style  reduces  the  staggering  load  on  a controller  that 
would  result  if  it  had  to  regulate  all  the  degrees  of  freedom  that  participate 
in  any  complex  action.  Through  such  simple  and  imprecise  stages  of 
preprocessing,  enough  of  the  generality  is  removed  from  the  world  seen  by  a 
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regulatory  system,  so  that  a regulator  can  succeed  that  Is  much  too  simple 
to  succeed  in  a general  world. The  specific  problems  of  implementing 
these  obvious  ideas  form  the  starting  point  for  our  theoretical  study  of 
heterarchical  task  organization,  which  is  a necessary  part  of  creating  an 
adequate  high-level  language  of  control  actions  (as  distinguished  from  the 
variabilities  of  individual  low-level  realizations  of  these  actions). 

This  style  particularly  suits  computers  that,  without  excessive 
computation,  must  regulate  actions  whose  individual  realizations  must  be 
infinitely  variable  (as  parameters  and  conditions  change),  under  the 
constraints  of  limited  information  and  coirmuni cation  within  a computer 
network,  or  within  a large-scale  integrated  circuit  chip  whose  components  are 
not  easily  accessible  from  the  outside.  TECHNEC  will  enable  us  to  experiment 
with  this  style  of  control,  through  parallel  execution  and  control  of  the 
partial  solution  by  the  processors  in  the  network,  and  to  develop  software 
(compilers,  languages,  simulation  methods)  for  implementing  this  style. 

As  an  example  of  distributed  control  of  a multi variaole  system,  we  are 
simulating  a robot  arm.  We  program  with  a mind  to  applicability  of  our 
methods  to  the  general  kinds  of  control  tasks  that  would  be  found  in  process 
control,  reactors,  spacecraft,  etc.  The  arm  is  propelled  by  torques  applied 
to  its  joints.  These  impulses  produce  ballistic  ("freehand")  movements,  in 
which  a movement,  once  started,  is  continued  by  the  momentum  of  the  arm, 
rather  than  the  type  of  movement,  common  in  numerical  control  and  robot 
research,  in  which  the  arm  is  continuously  guided  through  a succession  of 
closely  spaced  points.  Thus,  we  seek  to  make  the  arm  follow  a desired  path 
by  subjecting  it  to  a small  number  of  simply  specified  inputs  (generated 
by  small  subroutines  or  simple  components),  rather  than  a large  number  of 
small  impulses,  each  requiring  a separate  specification.  Our  main  problem  is 
to  get  from  a description  of  the  desired  movement  to  a description  of  the 
necessary  inputs. 

The  trend  these  days  in  practical  and  experimental  robot  arms  is  to  use 
an  arm  having  the  minimum  number  of  degrees  of  freedom  (6)  needed  to  place 
the  hand  in  an  arbitrary  orientation  at  an  arbitrary  location.  The  reason  is 
a desire  to  minimize  computational  complexity.  However,  this  means  that,  to 
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a given  hand  position  there  corresponds  a unique  configuration  of  all  the 
joints,  so  that  a desired  path  1n  space  calls  for  a trajectory  In  joint- 
configuration  space  that  (1)  must  be  elaborately  computed,  and  (2)  is 
almost  surely  unlike  any  trajectory  achievable  through  free  ballistic 
movement,  so  that  poInt-by-poInt  control  must  be  imposed. 

In  contrast,  a human  arm  has  many  more  degrees  of  freedom  than  the 
minimum,  but  when  we  swing  our  arm  to  pick  something  up,  we  lock  most  of 
these  degrees  of  freedom,  and  use  a cleverly  chosen  small  subset  of  joint 
movements  to  produce  an  approximate  realization  of  the  desired  path  that 
can  be  achieved  balllstically  by  a very  small  number  of  impulses.  In  this 
way,  the  existence  of  so  many  degrees  of  freedom  does  not  complicate 
control  in  our  style;  rather.  It  allows  the  selection  of  small  subsets  of 
these  degrees  of  freedom  for  use  In  simply  conceived  and  executed  recipes 
of  movement.  Thus,  what  you  see  as  one  arm,  conceptually  functions  as 
perhaps  100  different  "virtual  arms".  Each  of  these  virtual  arms  has 
simple  ways  of  doing  particular  kinds  of  movements.  Our  control  system 
should  maintain  a catalogue  of  movement  descriptions  with  pointers  to  their 
appropriate  virtual  arms  and  recipes  for  using  them. 

Me  are  currently  working  on  such  a catalogue  for  handball-like  task, 
subject  to  constraints  of  limited-precision  observation  and  computation  and 
the  need  to  operate  In  real  time.  As  a prerequisite,  we  have  Implemented  an 
algorithm  for  simulating  such  an  arm.  Parallel  processes  In  our  network  will 
embody  the  catalogue  (with  storage  and  retrieval  programs),  virtual  arm 
"assembly",  torque  Impulse  generation,  monitoring,  and  "demons"  to  notice 
significant  conditions  requiring  interrupts  or  activation  of  subroutines. 
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We  have  attempted  to  explain  the  decisions  made  in  arriving  at  the 
design  of  the  network  computer  being  constructed  at  Illinois  Institute  of 
Technology.  Design  alternatives  and  relation  to  other  interconnected 
computer  systems  composed  of  minicomputers  and/or  microcomputers  are 
examined.  The  network  architecture  and  software  facilities  are  presented 
briefly  with  emphasis  on  the  message  communication  mechanism.  Important 
strategies  to  utilize  such  a network  computer  are  enumerated  and  illustrated 
by  two  applications.  We  believe  that  such  a system  opens  new  areas  of 
application  on  a collection  of  loosely  coupled  microcomputers. 
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ABSTRACT 

The  advent  of  the  microprocessor  once  again  raises  the  question  of 
programming  suitability.  This  paper  deals  with  the  development  of  semi- 
automatic generation  of  cross-compilers  for  writing  high-level  language 
c^oss-compilers  to  support  microcomputers.  The  techniques  used  are 
the  translation  of  the  source  language  to  a macro  string  for  processing 
by  a macro  cross-assembler  to  produce  machine  dependent  object  code.  The 
automatic  production  of  cross-assemblers  is  discussed.  This  software 
should  provide  for  rapid  generation  of  cross-compilers  which  will  result 
in  greater  usage  of  high-level  languages  for  software  development  for 
microcomputers. 


*This  work  was  partially  supported  by  the  National  Science  Foundation  Grant 
DCR75-03578. 
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INTRODUCTION 


! 

i. 

The  large  scale  integrated  technology  of  the  seventies  has  created  ; 

■ 

a revolution  in  computing  systems.  First,  there  was  the  introduction  of  ^ 

the  microprocessor  [1]  in  1971.  These  early  and  primitive  multiple  chip 

CPUs  were  quickly  followed  in  1974  by  the  introduction  of  more  sophis-  {: 

I'! 

ticated  eight  bit  microprocessors  such  as  the  Intel  8080,  Signetics  2650, 

, 1 

Motorola  M6800,  etc.  culminating  in  the  current  versions  of  these  devices  |j 

plus  many  new  ones.  In  addition  to  the  LSI  impact  on  CPU  elements,  the  || 

i i 

same  technology  has  produced  the  16K  dynamic  ram  and  the  4K  static  ram 
resulting  in  1977  memory  prices  of  about  $250  per  16K  bytes.  These  prices 
indicate  that  current  generation  microprocessors  which  typically  have  ■ 

direct  addressing  capability  of  65K  can  be  fully  memory  configured  for 
approximately  $1,000.  In  1977,  the  CPU  and  memory  chips  were  supplemented  ^ 

with  the  introduction  of  peripheral  processors  for  use  as  disk  controllers, 
keyboard  controllers,  etc.  This  will  permit  very  low  costs  for  data 
entry  peripherals  and  data  storage  peripherals  such  as  the  keyboard-CRT 
terminal  device  and  the  floppy  disk  rotating  mass  storage  device.  LSI  has 
therefore  produced  the  $4,000  to  $5,000  8-bit  microcomputer  with  a rela- 
tively large  amount  of  memory  (65K)  and  with  mass  storage  and  input/output 
facilities.  Such  is  the  current  state  of  the  hardware. 

Unfortunately,  software  development  is  lagging  hardware  development 
severely.  Microcomputer  software  is  following  in  the  same  footsteps  as 
its  minicomputer  precedessor.  First,  the  development  of  relatively  crude 
resident  assemblers  followed  by  small  operating  systems  (monitors)  and 
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then  one  or  two  relatively  crude  languages.  This  is  illustrated  in  the 
microcomputer  by  the  emphasis  on  such  languages  as  BASIC  [2], 

One  innovation  in  microcomputer  software  has  been  the  use  of  larger 
machines  as  software  development  aids,  that  is,  the  large  scale  usage 
of  cross-system  software.  Cross-system  software  [3]  can  be  highly  soph- 
isticated because  it  takes  advantage  of  the  host  machine's  sctware  cap- 
abilities. For  example,  cross-assemblers  need  not  be  crude  but  can  include 
such  features  as  found  on  maximachine  assemblers,  macro  facilities,  con- 
ditional assembly,  large  numbers  of  pseudo  operations,  etc.  This  allows 
full  use  of  the  maximachine  to  be  used  in  a more  effective  manner  for 
software  development  rather  than  using  the  crude  resident  assemblers. 
However,  even  many  of  the  cross-system  assemblers  are  little  more  than 
mnemonic  translators. 

Intel  recognized  the  need  for  a systems  language  and  their  pioneering 
effort  with  the  PL/M  language  is  to  be  applauded.  This  language  is  used 
in-house  for  nearly  all  of  Intel's  software  development.  The  first  versions 
of  PL/M  were  cross-compilers  for  the  8008  and  8080.  PL/M  has  subsequently 
been  imitated  by  other  vendors  to  provide  cross-compilers  for  their  micro- 
processors. Examples  are  PLyS,  MPL,  and  others. 

If  there  are  any  characteristics  that  attach  to  the  problems  of  pro- 
grams run  on  microcomputers,  it  is  that  they  are  primarily  systems  programs. 
That  is,  they  are  programs  with  long  life,  they  are  well  debugged,  perhaps 
quite  complex  and  often  directly  concerned  with  hardware.  Moreover,  micro- 
computers are  often  devoted  to  a single  task.  For  example,  as  a stand- 
alone BASIC  or  APL  system  used  in  an  educational  environment,  a data 
acquisition  system,  driving  graphics  displays  or  doing  real-time  laboratory 
analysis.  In  many  instances,  the  task  occupies  nearly  the  full  resources 
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of  the  machine.  However,  there  are  many  more  where  processing  time  is 
available  but  unused.  The  fact  that  most  of  these  programs  are  systems 
oriented  is  evidenced  by  the  software  that  has  been  developed  for  these 
machines.  Therefore,  there  is  certainly  a need  for  high-level  languages 
to  support  system  software  development. 

But,  the  real  problem  is  that  software  must  be  generated  from  scratch 
for  each  new  microprocessor  introduced  into  the  market.  The  large  number 
of  microprocessors  has  made  it  paramount  that  new  techniques  be  developed 
for  the  implementation  of  high-level  languages  for  the  numerous  micropro- 
cessors that  already  exist  and  innumerable  ones  that  will  exist  in  the 
future.  New  software  techniques  are  necessary.  The  GEN  [4,  5]  system 
developed  at  Colorado  State  University  was  an  attempt  to  automate  the 
production  of  microcomputer  cross-assemblers  and  simulators.  The  same  is 
required  for  high-level  languages. 

The  compilation  problem  has  been  carefully  analyzed  and  has  been 
reduced  to  one  now  of  requiring  automatic  code  generation  in  order  to 
automatically  produce  compilers;  but,  this  remains  an  unsolved  problem 
although  there  are  a number  of  references  [6-9]  which  describe  the  problem. 
It  is  the  intent  of  the  present  author  to  pursue  the  development  of  high- 
level  language  cross-compilers  using  sophisticated  cross-assemblers  with 
macro  and  conditional  assembly  facilities  in  order  to  generate  object 
code  for  different  microprocessors. 

Before  proceeding  to  the  discussion  of  the  high-level  language 
cross-compiler,  the  next  section  briefly  introduces  the  ASM/GEN  system 
for  automatically  producing  cross-assemblers  for  microcomputers. 
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DESCRIPTION  OF  ASM/GEN 


ASM/GEN  is  a cross-assembler  generating  system  for  small  computers  [10]. 
The  concept  behind  the  program  development  was  to  construct  a software 
system  that  would  produce  cross-assemblers  from  a user  provided  machine 
description.  ASM/GEN  consists  of  a single  generator  routine  which  generates 
the  target  assemblers  with  two  modules  of  pre-written  code  (referred  to  as 
"skeleton  decks")  provided  with  the  package  and  the  user-provided  input. 

The  generated  cross-assemblers  are  highly  modular  in  their  structure  so 
as  to  make  ad  hoc  modifications  a more  straight-forward  process. 

Other  attempts  to  solve  this  problem  have  been  made  using  a somewhat 
different  approach.  The  approach  was  to  develop  a generalized  (or  uni- 
versal) assembler  called  a meta-assembler  [11-14).  The  meta-assembler  is 
basically  a machine  independent  cross-assembler  which  takes  both  the 
assembly  source  program  and  the  machine  dependent  specifics  as  part  of 
the  input  for  each  run. 

The  syntax  of  a meta-assembly  program  is  characteristic  of  a typical 
assembly  program.  It  includes  a label  field,  op-code  mnemonic  field, 
operand  mnemonic  fields,  and  provision  for  typical  pseudo  directives.  In 
addition,  the  programmer  includes  meta-directives  stating  the  word  size  of 
the  computer,  number  representation,  and  other  machine  specific  descrip- 
t'l/i-;.  The  machine  instruction  set  syntax  is  described  via  a FORMAT 
directive.  The  FORMAT  directive  includes  an  identifier,  the  expressions 
to  be  assembled  into  object  code,  and  the  length  and  relative  ordtTing 
of  the  expressions.  The  programmer  references  a given  format  by  writing 
its  identifier  in  the  op-code  (or  command)  field,  and  the  expressions  to 
be  translated  in  the  operand  fields. 

Another  meta-directive  is  the  PROCEDURE  directive.  The  procedure 

K 

has  one  or  more  identifiers  and  is  referenced  by  writing  any  of  them  in 
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the  command  field.  Procedures  can  be  used  to  generate  one  or  more  instruc- 
tions and  might  be  aptly  described  as  a generalized  macro  facility. 


Although  both  ASM/GEN  and  the  meta-assembler  solve  essentially  the 
same  problem,  the  following  differences  should  be  noted: 

1.  ASM/GEN  requires  a description  of  the  machine  dependent 
data  only  in  the  creation  run  of  the  generated  assembler, 
while  the  meta-assembler  requires  this  information  to 

be  part  of  each  assembly;  and 

2.  ASM/GEN  generates  an  assembler  for  a specific  machine 
which  then  can  be  used  repeatedly  for  the  assembly 
translation  process,  while  the  meta-assembler  itself 
accomplishes  the  assembly  translation  in  a universal 
fashion  permitted  by  the  inclusion  of  the  machine 
specific  descriptions. 

The  most  important  consideration  in  ASM/GEN  is  the  correctness  and 
quality  of  the  generated  assemblers.  The  philosophy  taken  was  to  provide 
one  standard  set  of  assembly  statement  syntax  conventions,  constant  spec- 
ifications, symbol  and  macro  table  management  schemes,  and  all  pseudo 
instructions.  There  are  several  advantages  and  at  least  one  disadvantage 
to  this  procedure.  On  the  positive  side,  it  most  importantly  minimizes 
the  required  user  input  and,  thus,  greatly  reduces  the  assembler  generating 
effort.  Note  that  this  is  accomplished  by  consolidating  all  scanners, 
table  management  routines,  pseudo  instruction  processors,  etc.  into  the 
skeleton  decks. 

Another  important  advantage  is  in  the  standardization  effect. 
Typically,  microcomputer  systems  programmers  may  be  constantly  working 
with  a variety  of  microcomputers  due  to  rapidly  changing  technologies  and 
the  phenomenal  diversity  the  market  offers.  Hence,  if  the  ASM/GEN  system 
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is  used  to  generate  all  assemblers,  they  will  have  a great  deal  in  common. 
Those  who  have  had  to  work  in  several  computer  languages  concurrently  can 
appreciate  the  advantage  offered  here  by  the  standardized  syntax  and 
pseudo  instructions. 

One  disadvantage  of  this  scheme  lies  in  the  inability  of  ASM/GEN 
assemblers  to  process  vendor  specified  assembly  code.  This  lack  of  port- 
ability can  be  inconvenient  and  should  be  understood  by  those  using  the 
ASM/GEN  system  to  avoid  misguided  and  unnecessary  expenditures  of  human 
energy!  The  point  here  is  that  the  ASM/GEN  system  can  build  a well  fac- 
ilitated cross-assembler  for  a given  microprocessor  but  not  the  vendor's 
assembler. 

In  order  to  generate  a cross-assembler  for  any  target  microprocessor, 
the  user  provides  to  ASM/GEN  as  input  the  opcode  and  operand  mnemonics, 
their  binary  translations,  and  the  bit  position  information  for  a specific 
machine.  This  is  used  with  the  invariant  sections  of  the  assembler  to 
generate  a complete  ANSI  FORTRAN  IV  macro  assembler  in  four  distinct  phases 
which  are  discussed  below. 

In  the  first  phase,  all  invariant  code  (a  collection  of  subroutines 
and  functions)  are  transferred  to  the  file  to  contain  the  complete  assembler. 
This  code  accounts  for  about  eighty  percent  of  the  entire  assembler.  The 
variant  code  consists  of  the  routines  which  process  assembly  translation 
of  the  machine  instructions  on  a class-wise  basis  (as  described  in  [10]), 
a routine  which  directs  the  assembler  to  the  proper  translation  class 
module  for  processing,  and  the  complete  data  definition  of  all  symbol 
table  entries. 

Phase  one  is  accomplished  by  using  a skeleton  deck  as  standard  ASM/GEN 
input.  The  skeleton  deck  includes  all  invariant  program  sections  and  a 
marker  to  flag  the  necessity  for  ASM/GEN  intervention.  A dollar  sign  in 
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column  one  was  used  as  the  marker  since  it  is  an  invalid  FORTRAN  usage. 
The  phase  one  action  can  be  described  as 


read  card  image  (from  skeleton  file  [SKELl] 

repeat  while  image  [SKELl]  f $ 

begin  write  card  image  (to  assembler  file); 

read  card  image  (from  skeleton  file  [SKELl]; 
end 

using  an  Algol-like  language.  In  addition,  the  first  user  input  card  is 
read  and  stored.  This  card  contains  the  number  of  translation  classes  and 
the  word  size  of  the  machine  in  bits. 

Phase  two  is  the  translation  class  routine  builder.  It  reads  the 
invariant  statements  of  a class  routine  from  skeleton  file  [SKEL2]  and 
uses  the  same  marker  scheme  used  in  phase  one.  The  user  input  for  each 
translation  class  consists  of  information  characteristic  of  all  instruc- 
tions in  a given  class,  followed  by  the  mnemonics  and  hexadecimal  equiva- 
lents for  each  machine  instruction  in  the  class.  The  characteristic 
information  includes  the  class  identification  number,  the  maximum  number 
of  op-code  and  operand  mnemonic  fields  allowable,  the  number  of  object 
code  words  generated  per  instruction,  the  internal  bit  position  where  the 
binary  equivalent  of  each  field  is  to  be  placed,  and  the  size  (in  bits)  of 
each  field  in  the  object  code  word. 

The  internal  bit  position  of  each  field  is  specified  via  the  imaginary 
word  formed  by  concatenating  all  object  words  together  in  a left-to-right 
fashion.  For  example,  if  there  are  x bits  per  word  and  y words  of  object 
code  generated  for  all  instructions  in  the  given  class,  then  there  are 
z = x*y  bits  in  the  imaginary  word.  The  user  specifies  the  right-most  bit 
position  of  each  field  as  it  would  be  in  the  imaginary  word.  Bit  z-1  is 
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assumed  to  be  the  left-most  bit  of  the  first  object  word,  and  bit  0 is 
assumed  to  be  the  right-most  bit  of  the  last  object  word. 


The  last  card  of  each  translation  class  input  group  is  END.  This 
signals  ASM/GEN  to  rewind  the  skeleton  [SKEL2]  file  and  begin  the  process 
over  for  the  next  translation  class  to  be  defined.  A special  case  is 
the  specification  of  the  general  mnemonics  used  in  the  operand  fields  or 
in  expressions  which  are  not  to  be  confused  with  op-code  mnemonics.  These 
are  designated  by  using  class  identification  number  0,  followed  by  each 
mnemonic  and  its  binary  equivalent.  As  with  all  other  translation  class 
input  groups,  it  is  terminated  by  an  END  card.  The  entire  input  deck  is 
terminated  by  detection  of  an  end  of  file  card  immediately  following  an 
END  card. 

Phase  three  begins  after  all  user  input  has  been  read  and  the  symbol 
table  entries  for  all  mnemonics  and  their  binary  equivalents  constructed. 
The  purpose  of  phase  three  is  to  construct  the  routine  which  maps  a 
given  op-code  mnemonic  into  its  class  value  and  then  branches  to  that 
translation  class  routine.  Mapping  is  done  via  the  symbol  table  and  the 
branch  is  accomplished  using  a computer  60  TO  statement  to  a sequence  of 
CALL  statements,  one  for  each  translation  class  routine. 

The  final  phase  writes  out  all  data  definitions  in  the  BLOCK  DATA 
routine.  The  data  includes  the  class  mapping  function  table,  all  reserved 
symbol  table  entries,  the  machine  specific  constants,  and  all  pointer 
initializations  for  the  assembler.  Thus,  the  completed  macro  assembler  is 
left  on  a disk  file  for  subsequent  use. 

The  generated  assembler  operates  on  a two-pass  basis  to  allow  simple 
resolution  of  forward  label  referencing.  The  first  pass  reads  the  source 
statements,  translates  them,  and  writes  out  resulting  information  to  an 
intermediate  file.  When  a forward  reference  is  detected,  the  entire  source 
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statement  is  written  to  the  intermediate  file,  A code  is  used  to  dis- 
tinguish between  the  two.  Also,  all  error  messages  are  recorded  on  an 
intermediate  error  file  by  using  an  error  number  and  the  statement  number 
where  it  was  detected.  In  addition,  an  error  flag  is  set. 

The  second  pass  reads  the  intermediate  file  and  transfers  the  infor- 
mation accumulated  in  the  first  pass  to  the  listing  and  load  block  routines. 
Statements  with  forward  references  are  translated  and  the  resulting  infor- 
mation is  transferred  to  the  listing  and  load  block  routines.  Any  errors 
detected  in  pass  two  are  written  to  a second  intermediate  error  file  in 
the  manner  discussed  above.  When  the  end  of  the  intermediate  file  is 
detected,  the  error  flag  is  checked.  If  true,  an  error  merge  routine  is 
invoked,  merging  and  ordering  error  records  by  statement  number.  The 
error  listing  routine  is  called  next  and  uses  the  error  numbers  as  com- 
puted GO  TO  indices  to  print  out  intelligible  diagnostics.  Finally,  the 
symbolic  reference  map  is  checked  and,  if  true,  the  reference  map  is 
printed  concluding  the  assembly. 

The  complete  assembler  contains  most  frequently  needed  facilities 
found  in  popular  contemporary  assemblers.  They  include  the  ability  to 
reference  the  program  counter  value  (PC),  free  formatting  with  colons 
delimiting  labels,  blanks  or  commas  delimiting  op-code  and  operand  mnemonic 
fields,  a semi-colon  for  operational  end  of  statement  delimiting,  multiple 
labels  on  a single  card,  PC  value  assignment  (*0RG),  PC  value  increment 
(*DS),  ASCII  text  string  specification  of  arbitrary  length  (*TXT),  label 
equivalence  (*EQU),  definition  and  assignment  of  special  arithmetic  type 
symbols  (*DEFA,  *SETA),  use  of  +,  -,  *,  / in  arithmetic  expressions,  macro 
definition  (*MACR0,  *MEND),  embedded  and  recursive  macro  generation,  con- 
ditional assembly  with  the  standard  IF,  THEN,  ELSE  control  structure. 
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ability  to  specify  a constant  using  any  radix  r,  2 £ r _<  16,  a listing 
title  designator  (*TITLE),  an  extensive  cross  reference  map  (*CREF),  and 
extensive  error  diagnostics. 
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ASM/GEN  exists  as  both  a batch-oriented  system  and  an  interactive 
version.  The  prerequisite  knowledge  required  of  ASM/GEN  users  is  fairly 
minor.  The  assembler  generating  process  is  a straight-forward  procedure 
which  consists  of  two  basic  phases.  They  are:  (1)  the  classification  of 
machine  instruction  mnemonics,  and  (2)  inputting  the  field  specifications, 
the  mnemonics  and  their  corresponding  hexidecimal  values.  Most  vendor 
literature  is  organized  in  this  manner  so  that  converting  the  necessary 
input  data  to  ASM/GEN  format  typically  takes  less  than  30  minutes.  Cross- 
assemblers  with  macro  and  conditional  assembly  features  can  be  generated 
in  less  than  one  hour  for  any  small  machine. 

Table  I presents  the  necessary  input  information  for  the  cross- 
assembler  for  the  National  SC/MP  as  an  illustrative  example. 
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OR  I DC 

END 

XRI  E4 

5 2 1 

oa;i;  EC 

0 0 

ADI  F4 

B 2 

CAI  FC 

XPAl.  30 

Dl-Y  OF 

XPAH  34 

END 

XPPC  3C 

3 3 2 

END 

n 8 0 

8 2 B 
JHP  90 
JP  94 
JZ  98 
JNZ  9C 
II.J)  A 8 
DLD  DB 
END 


TABLE  I 
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MACRO-BASED  HIGH-LEVEL  LANGUAGES 


Compiler  technology,  like  the  previously  discussed  assembler  tech- 
nology, also  recognizes  machine  invariant  and  machine  variant  code.  This 
has  been  pointed  out  by  numerous  authors.  Lexical  analysis,  syntactical 
analysis,  and  language  parsing  are  well  understood  and  algorithms  have 
been  developed  to  represent  machine  independency.  However,  in  the  area 
of  code  generation  and  code  optimization,  machine  dependencies  play  a 
major  role.  'As  stated  earlier,  automatic  code  generation  would  seem  to 
be  the  answer  to  the  compiler  problem,  particularly  for  microprocessors 
where  it  would  be  nice  to  have  the  same  language  features  supported  by 
a number  of  different  commonly  used  machines.  Bunza  [6]  has  indicated 
a number  of  different  schemes  that  could  be  used  for  code  generation. 

These  include  macro  expansion,  hierarchical  macro  interpretation,  pseudo 
machines  and  abstract  models,  interpretation  of  abstract  machine  code 
and  special  executive  calls,  code  to  code  translation,  and  the  operator 
data  base  and  data  classification  scheme  that  he  introduces  himself.  It 
would  seem  that,  with  the  availability  of  cross-assemblers  with  full  macro 
and  conditional  assembly  features,  the  first  four  categories  would  offer 
an  avenue  of  study  for  high-level  language  compilers.  These  are:  macro 
expansion,  hierarchical  macro  interpretation,  pseudo  or  abstract  machine 
models,  and  interpretation  of  abstract  machine  code. 

At  Colorado  State  University,  three  parallel  programs  are  presently 
under  investigation.  All  three  of  these  approaches  utilize  the  macro  cap- 
ability of  ASM/GEN.  The  projects  are  increasingly  complex.  The  first 
project  uses  the  macro  processor  as  a pseudo  high-level  language.  It 
allows  the  programmer  to  program  in  macros  rather  than  in  machine  language 
instruction.  Berry  and  Johnson  [15]  developed  a macro  based  language 
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for  programmable  data  acquisition  systems.  The  language  is  called  Data 
Reduction  j.anguage  (DRL). 

DRL  was  developed  to  essentially  automate  the  process  of  invoking 
all  the  correct  subroutines  to  perform  a set  of  operations  on  variables 
and  constants.  This  is  accomplished  primarily  through  conditional  assembly 
and  macro  expansion. 

DRL  currently  has  a limited  instruction  set  (six  basic  instructions) 
which  include  four  arithmetic  type  instructions  and  two  register  modifiers. 
The  six  basic  instructions  are:  LOAD,  MOVE,  ADD,  SUB,  MULT,  DIVIDE.  This 
limited  vocabulary,  however,  is  still  quite  powerful  as  will  be  illustrated 
with  a typical  user  program.  Table  II  gives  a brief  symbolic  description 
of  the  consequence  of  each  instruction. 


Name 

Parameters 

Action 

LOAD 

VARIABLE 

AC^VARIABLE 

MOVE 

DESTINATION, SOURCE 

DESTINATION^SOURCE 

ADD 

VARIABLE 

AC+-AC  + VARIABLE 

SUB 

VARIABLE 

AC^AC  - VARIABLE 

MULT 

VARIABLE 

AC^AC  * VARIABLE 

DIVIDE 

VARIABLE 

AC^AC  / VARIABLE 

TABLE  II:  DRL  INSTRUCTIONS 

The  variable  described  in  Table  II  may  be  one  of  the  two  following: 

1)  Constant.  A constant  is  defined  at  assembly  time  and 
occupies  memory  locations  in  a PROM. 

2)  System  Parameter.  This  variable  is  essentially  a channel 
number  on  the  A/D  converter.  LOAD  SYSPARM  will  invoke 
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a call  to  the  digitizing  routine  at  the  appropriate 
channel  number  and  then  scale  the  input  according  to 
a precedence  for  that  class  of  signal  (i.e,,  temperature 
conversion  from  °K  to  °F,  normalize  incident  energy  into 
langleys,  etc.). 

A typical  user  data  reduction  phase  might  consist  of  measuring  the 
collected  energy  from  a solar  collector  by  summing  an  incremental  energy 
flux  which  has  been  absorbed  in  the  working  fluid  from  an  ifiCpAT  calculation. 
Let  us  further  assume  that  a linear  model  of  Cp  versus  T has  been  con- 
structed: 

Cp  = .7236  Btu/lb°F  + .0006188  BTU/lb°F?T(°F) 

This  is  valid  for  a 60/40  (water/ethelene  glycol)  mixture.  The  energy 
flux  is  then: 


E.  = mC  fT^  - T^  ) 


p'  c 


out 


(per  minute) 


in 


Summing  for  some  number  of  intervals  k and  letting  the  time  interval  be 
one  minute  yields: 


k 


To  implement  this  reduction  in  DRL,  one  must  first  clear  EPSUM,  the  energy 
partial  sum.  This  may  be  done  by  defining  a constant  0.0  in  EPROM  and 
moving  this  constant  to  EPSUM: 

MOVE  EPSUM  ZERO 


The  DRL  to  implement  equation  (1)  is  then: 


LOAD 

TCOUT 

GET  TEMP  AT  COLL  OUTPUT 

SUB 

TCIN 

SUBTRACT  TEMP  AT  INPUT 

MOVE 

R1  AC 

SAVE  DELTA  T IN  R1 

LOAD 

TCOUT 

GET  COLL  TEMP  FOR  CP  CALC 

MULT 

.0006188 

MULTIPLY  BY  SLOPE 

ADD 

.7236 

ADD  INTERCEPT 

MULT 

R1 

MULTIPLY  BY  DELTA  T 
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1 


w 


— ^ 

MULT  CMDOT  ; MULTIPLY  BY  COLL  MASS  FLOW  RATE 

ADD  EPSUM  ; ADD  THIS  FLUX  TO  TOTAL 

MOVE  EPSUM  AC  ; RESTORE  ANS  AS  NEW  EPSUM 

The  actual  macro  expansions  of  these  DRL  statements  are  shown  on  the  next 

two  pages.  This  language  allows  for  rapid  development  of  software  for  the 

programmable  data  acquisition  system.  This  particular  system  is  built 

around  the  MOS  Technology  6502  microprocessor,  but  it  could  be  rewritten  \ 

I 

for  any  microprocessor.  Current  plans  are  to  extend  the  macro  set  to 
include  a control  structure  so  that  the  language  can  be  used  to  write 
process  control  algorithms  as  well  as  data  acquisition  and  processing 
functions. 

The  second  project  uses  Halstead's  [16]  PILOT  language  which  has 
been  rewritten  as  a FORTRAN  cross-compiler.  The  language  has  been 
slightly  modified  and  redesignated  the  micro  £ilot  ]^anguage,  yPL.  The 
first  pass  over  the  yPL  program  converts  the  source  code  into  a macro 
string  list.  This  program  consists  of  approximately  800  FORTRAN  card 
images  and  took  approximately  two  man  weeks  to  implement.  The  compiler 
itself  consists  of  two  basic  blocks:  the  lexical  scanner  and  analyzer 
and  the  parser  which  uses  a transition  table  parsing  scheme  based  upon  the 
current  operator/next  operator  pair.  Although  the  language  is  rather 
simple,  the  code  for  yPL  is  far  more  readable  than  the  typical  as- 
sembly language  mnemonics  as  illustrated  by  the  subroutine  for  a bubble 
sort  shown  on  Figure  1. 

The  output  macro  string  is  made  up  from  the  following  17  macros:  the 
program  halt  macro,  STOPIT;  the  indexed  addressing  macro  for  handling 
subscripted  variables,  INDEXIT;  the  unconditional  jump  macro,  BRANCH;  the 
subroutine  call  macro,  CALL  SUBROUTINE;  the  load  register  macro,  LOADIT; 
the  output  macro,  WRITEIT;  the  input  macro,  READIT;  the  return  from  sub- 
routine macro,  RETURN;  the  arithmetic  macros,  ADDIT,  SUBTRACTIT,  MULTIT, 
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and  DIVIDEIT;  the  store  register  to  memory  macro,  STOREIT;  the  equality 
comparison  macro,  EQUAL  TO;  the  less-than  comparison  macro,  LESS  THAN;  the 
data  define  and  store  macro,  DEFINE  STORE;  and  the  open  subroutine  macro, 
SUBROUTINE. 

A macro  map  of  these  17  macros  to  target  machine  code  completes  the 
process.  The  macro  and  conditional  assembly  features  of  ASM/GEN  are  used 
for  this  pass.  At  the  current  time,  the  only  operational  program  trans- 
lates to  8080  machine  code.  The  requirement  here  is  that  definitions  for 
each  of  the  required  macros  produced  by  the  language  have  to  be  specified 
so  that  an  individual  can  provide  a macro  map  for  any  required  micropro- 
cessor. This  macro  library  is  then  attached  to  the  ASM/GEN  generated 
cross-assembler  providing  a high-level  language  for  the  specified  micro- 
processor. Figure  2 illustrates  the  entire  process. 

Currently,  the  language  is  being  extended  to  include  boolean  operators. 
The  macro  definitions  are  also  being  documented.  In  addition,  the  cross- 
compiler  is  being  rewritten  in  uPL  in  order  to  self-compile  the  compiler. 

A simple  resident  macro  assembler  is  also  being  written  in  pPL.  This  will 
eventually  provide  for  resident  yPL  language  capability  on  various  micro- 
processors. 
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MOVE  Rl  AC  I SAVE  TEMPORASIlT  IN  Ri 


■ ■ — 

••••• 

COXi 

— BYTEW^i 

kOAAX 

AC 

SlAZX 

m 

, 

i 

OCX 

bpl 

Fa«i6 

1 

■ 

load 

uOAI 

TCOUT 

lCOuX»Z0*l6 

1 GET  collector  output  temp  again  I 

• •••• 

<J§R 

0I6IT1Z 

:i 

• •••• 

JSR 

TEMP 

j 

J§R 

SmAP 

i 

#•### 

••••• 

mOlT 

lOxi 

C0006 

ByTEN0»I 

1 multiply  8T  the  slope  I 

1 

••••• 

LOAAX 

C0006 

F8*16 

fWt- 
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-*00- 

lOXI 


-C7236 

6yteno*i 


-*0O-i+f-7HE-^NTE«S€ 


I 


• •••• 

••••• 

• •••• 

jULfttta- 

••••# 


.L4 


-5?AZX T£MPi- 


OEX 

BPL 

jSr 


F8*16 

FAOO 


mult 

Rl 

1 multiply 

BY 

TME  D;FF  temp  in  Rl 

••••• 

lOaax 

••••• 

sTazx 

TEMPI 

••••• 

OEX 

eet. 

F6*16 

••••• 

jSR 

fmul 

#•••* 

MUlT 

LpAl 

cmoot 

CMpOT*20+l6 

1 multiply 

BY 

TljE  COLL,  MASS  Flow  pate 

••••• 

jSr 

PIGIHZ 

••••• 

••••• 

JSH 

PRES 

•••••  .L3 

«•••• 

J§R 

FHUL 

••••* 


• L4 


*00 

tpxi 


EPSOM 

&yTENO*l 


I *00  TO  The  accumulated  enepoy 


••••• 

••••• 


tPAAX  EPSUM 


SlAZX 


TEMPI 


f' 

S-‘ 


sort:  •<: 

: I .< If 

NTEiST  <-  N~2t 

DO  while:  switcheidj 

SWITCHED  <-  Or 
INNER  LOOT-: 

I <--  I Hr 

» K <-  I Mr 

AC  1 3 < ACKT  ? INNER  LOOP . r 
SWITCHED  <-  Ir 
TEMP  <-  ACi:.1r 

i AC  II  <~  ACK]r 

ACK3  <-  TEMPr 
I < NTEST  ? INNER  LOOP.r 

I SWITCHED  =1  ? DO  WHILE  SWITCHED,  r> 


I 
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HIGH-LEVEL 
LANGUAGE  SOURCE 


Figure  2:  Use  of  GEN  System  Software 


The  viPL  language  is  readily  extendable  in  some  ways  by  simply 
increasing  the  size  of  the  transition  table  parser.  However,  it  is 
basically  a simple  language  — all  variables  are  global,  no  looping 
constructs,  non-block  structured,  etc.  To  overcome  this  inherent  sim- 
plicity, a more  complex  systems  language  is  also  being  worked  on.  This 
language  is  PASCAL  5,  the  student  subset  of  PASCAL. 

Two  studies  are  under  investigation.  The  first  is  identical  to  the 
pPL  procedure.  That  is,  a macro  library  is  constructed  for  processing 
the  macro  string  intermediate  text.  The  second  approach  uses  a pseudo 
machine  interpreter.  These  two  techniques  should  enable  us  to  determine 
necessary  sets  of  macros  and  natural  pseudo  machine  architectures  for 
supporting  high-level  language  development.  If  these  techniques  prove 
out,  it  should  be  possible  to  quickly  generate  cross-compilers  and, 
ultimately,  resident  compilers  for  any  microprocessor. 
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ABSTRACT 

Raster  scan  computer  graphics  displays  require  the  image  generated 
by  the  program  to  be  converted  to  raster  scan  order.  This  is  almost 
always  done  using  a large  frame  buffer  memory.  An  alternative 
technique  is  to  substitute  for  the  frame  buffer  memory  enough 
processing  power  to  perform  the  conversion  "on  the  fly"  for  each  frame. 
It  appears  that  microprocessors  can  now  provide  this  processing  power 
at  low  cost.  One  possible  Implementation  of  such  a raster-scan 
conversion  algorithm  is  presented  which  uses  one  LSI  microprocessor  and 
one  small  special  purpose  processor  running  concurrently  in  a pipelined 
fashion.  With  today's  microprocessor  technology,  this  approach  is 
shewn  to  be  feasible  and  its  economics  compare  favorably  with  a frame 
buffer  system  of  similar  performance. 


This  report  was  prepared  as  a result  of  work  performed  under  NASA 
Contract  No.  NASl-14101  while  the  author  was  in  residence  at  ICASE, 
NASA  Langley  Research  Center,  Hampton,  VA  23665. 


RASTER-SCAN  CONVERSION  USING  CONCURRENT  MICROPROCESSORS 


I.  INTRODUCTION 

Computer  graphics  display  devices  can  be  classified  as  either 
random  or  raster  scan  devices.  Random  scan  devices  allow  the  image  to 
be  drawn  on  the  display  in  any  order  generated  by  an  application 
program.  For  line  drawings,  this  is  often  specified  by  a list  of 
vector  endpoints.  Raster  scan  devices  are  constrained  to  display  the 
image  according  to  some  specific  order,  usually  left  to  right  along 
horizontal  scanlines.  Therefore,  before  displaying  an  image  generated 
by  a program,  the  data  must  somehow  be  sorted  so  that  it  is  available 
to  the  raster  display  device  in  the  proper  order.  This  raster  scan 
c aversion  process  poses  an  added  complexity  for  raster  scan  displays. 
However,  once  this  conversion  is  accomplished,  raster  scan  displays 
have  some  advantages  over  random  scan. 

First,  constraining  the  deflection  of  a CRT  display  to  follow  a 
fixed  raster-scan  pattern  considerably  simplifies  the  analog  deflection 
electronics  needed.  Ordinary  television  receivers  are  raster  scan  CRT 
devices.  Compared  to  available  random  scan  computer  display  devices 
they  are  Inexpensive,  provide  color,  and  require  much  less  adjustment 
of  the  analog  deflection  circuitry.  Also  they  are  widely  used. 


Second,  the  time  required  to  display  one  frame  with  i;aster  scan 
devices  is  a constant  (1/30  sec  usually).  Random  scan  devices, 
however,  usually  require  a display  time  roughly  proportional  to  the 
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total  length  of  all  vectors  being  displayed.  Complex  pictures  may 
require  too  much  time  to  display  (over  1/30  sec)  and  therefore  appear 
to  flicker.  On  raster  scan  devices  an  arbitrarily  complex  image  can  be 
displayed  without  flicker  provided  it  is  specified  within  the 
resolution  limits  of  the  raster  scar,  display.  Thus  raster  scan  devices 
are  usually  used  when  it  is  desirable  to  display  surfaces,  which 
require  considerably  more  displayed  vector  length  than  corresponding 
line  drawings.  However,  raster  scan  devices  suffer  from  a "stair 
stepping"  effect  when  used  to  draw  non-horizontal  or  non-vertical 
lines.  Random  scan  devices  do  not  suffer  from  this  effect. 

II.  APPROACHES  TO  RASTER-SCAN  CONVERSION 

A frame  buffer  memory  is  almost  universally  used  to  accomplish  the 
raster-scan  conversion  process.  buffer  memory.  One  word  of  this 
memory  is  assigned  to  each  resolvable  (x,y)  position  on  the  display  so 
that  Increasing  addresses  scan  the  display  screen  in  the  raster-scan 
order.  The  contents  of  any  one  word  of  this  memory  specify  the 
intensity/color  of  the  associated  position  on  the  screen.  The  frame 
buffer  memory  is  loaded  in  any  order  required  by  the  program  with 
intensity/color  Information  describing  an  image.  The  raster-scan 
output  is  then  produced  by  scanning  the  memory  sequentially  from  its 
lowest  to  highest  address.  The  operation  performed  here  is  actually  a 
"bucket  sort"  with  each  word  of  the  frame  buffer  being  one  bucket 
capable  of  holding  one  datum. 

Using  a frame  buffer  memory  with  high  resolution  or  many 
intensity/color  levels  requires  much  memory.  For  example,  a 512x312 


resolution  with  512  intensity/color  coabinations  requires 
512x512x9  ••  2,359,296  bits  of  memory.  Also,  to  provide  real  time 
motion,  each  consecutive  frame  may  display  different  images.  For  this, 
the  memory  speed  must  be  fast  enough  to  allow  a new  image  to  be  written 
into  the  frame  buffer  within  one  frame  time  (1/30  sec).  This  writing 
time  is  a function  of  the  complexity  of  the  picture.  In  general, 
writing  must  proceed  one  word  at  a time  since  access  is  random.  Of 
course,  if  real-time  motion  is  not  desired,  the  frame  buffer  can  be 
filled  slowly,  after  which  it  can  be  displayed  for  many  frames. 

An  alternative  approach  is  to  perform  the  raster  conversion  process 
by  sort  techniques  that  do  not  require  a large  memory.  With  this 
approach,  enough  processing  power  is  required  for  the  entire  sort  to  be 
done  "on  the  fly"  for  each  frame,  even  for  systems  without  real-time 
motion.  This  approach  therefore  would  seem  to  be  useful  for  systems 
which  need  real  time  motion.  Also,  the  complexity  of  a moving  image 
that  can  be  handled  in  real  time  is  now  determined  by  the  speed  of  this 
processing  rather  than  by  the  speed  of  the  frame  buffer  memory.  Using 
the  approach  described  below,  the  speed  of  the  processors  must  grow 
linearly  with  resolution  to  display  an  image  of  a given  complexity. 


Frame  buffer  memory 

size  and 

speed 

grow  as 

the  square  of 

the 

resolution.  Thus  the 

processor 

sort 

approach 

appears  useful 

for 

systems  requiring  high  resolution.  Jordan  and  Barrett[l]  proposed  one 
such  conversion  algorithm  for  line  drawings.  Earlier,  Erdahl[2] 
described  the  design  of  hardware  for  executing  the  last  portion  of  the 
scan  conversion  process  for  surface  drawings.  More  recently,  Meyer [3] 
has  reported  on  a system  in  operation  with  hardware  for  this  purpose. 
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This  hardware,  made  by  Staudhammer  and  associates [4 j , generates  video 
in  real  time  from  run  length  encodings  of  the  images. 

Large  frame  buffers  for  high  resolution  diplays  are  becomming 
economically  feasible  due  to  the  dropping  cost  of  memory  for  frame 
buffers.  However,  processor  cost  is  also  dropping  with  the  advent  of 
inexpensive  microprocessors.  In  the  following  section,  a method  of 
raster-scan  conversion  is  presented  which  could  be  implemented  using 
several  small  processors  running  concurrently.  Such  a system  would  be 
capable  of  moderate  resolution  and  real  time  motion  of  moderately 
complex  images.  It  is  argued  that  the  processors  now  becomming  readily 
available  have  the  capability  of  performing  the  raster  scan  conversion 
process  and  that  because  of  their  cost  relative  to  memory  costs,  this 
approach  currently  compares  favorably  with  a frame  buffer  system  of 
similar  parameters. 

III.  RASTER-SCAN  CONVERSION  PROCEDURE 

In  order  to  show  the  feasibility  of  using  concurrent  processors  to 
Implement  the  approach  described  in  this  section,  we  will  assume 
reasonable  resolution  and  picture  complexity  parameters,  determine  the 
required  processing  speeds,  and  then  present  one  possible  design  using 
these  processors  that  could  be  implemented  from  readily  available 
microprocessor  components.  Finally  we  make  a comparison  of  hardware 
requirements  of  this  implementation  with  the  requirements  of  a frame 
buffer  implementation.  Specifically, 
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1.  Assume  512x512  resolution.  This  is  adequate  for  many  purposes. 

2.  Assume  9 bits  of  intensity/color  levels  (say  8 levels  of  each  of  3 

colors) . 

3.  Assume  the  picture  complexity  is  at  most  2000  straight  vectors  (for 
line  drawings)  or  2000  surface  edges  (in  the  case  of  surface 
drawings).  This  number  was  obtained  by  counting  lines  on  several 
drawings  of  aircraft  and  spacecraft  obtained  from  engineers 
Involved  in  vehicle  analysis  at  NASA  Langley  Research  Center. 

4.  Assume  the  maximum  number  of  vectors  (surface  edges)  that 

intersect  any  horizontal  scan  line  is  500.  This  is  25  percent  of 
the  entire  picture.  The  drawings  mentioned  in  (3)  above  had  at 

most  13  percent  of  their  vectors  on  any  one  scan  line. 

5.  Assume  a refresh  rate  of  30  frames/second. 


In  order  to  compare  the  cost  of  implementing  this  raster-scan 
conversion  method  with  the  frame-buffer  method,  we  must  be  able  to 
compare  processors  with  some  equivalent  amount  of  memory.  A quick 
search  through  the  microcomputer  literature  at  the  time  of  this  writing 
reveals  that  a fast  microprogrammable  processor  in  the  class  of  the 
INTEL  3000  series [5]  or  All  2900  series (6],  can  be  obtained  for 
approximately  the  cost  of  24K  bytes  of  MOS  memory.  A processor  of  this 
class  with  an  appropriate  word  size  will  hereafter  be  called  a "fast 
microprocessor".  Such  processors  are  today  available  on  a few  LSI 
circuits,  are  microprogrammable,  can  b^  implemented  with  any  convenient 
word  size  (bit  sliced),  and  can  execute  5 to  10  million 
microinstructions  per  second.  Their  cost  relative  to  memory  cost  may 
or  may  not  remain  constant  in  the  future.  At  a low  level  both 
processors  and  memory  bits  may  be  regarded  as  some  number  of  logic 
gates  to  be  fabricated  onto  one  LSI  integrated  circuit.  For  this 
reason  one  might  expect  the  costs  of  processors  and  memory  to  be  at 
least  somewhat  correlated. 


A.  INPUT  DATA  AND  Y-SORT  PROCESSOR 


The  input  data  describe,  in  an  encoded  manner,  the  image  to  be 
converted  to  raster-scan.  This  consists  mainly  of  the  endpoints  of 
vectors  along  with  their  intensity/color . To  generate  surface  Images, 
these  vectors  are  taken  to  represent  the  left  edge  of  the  surface,  in  a 
manner  described  in  section  C.  For  this  example,  endpoints  are  given 
as  pairs  of  9 bit  integers  of  (x,y)  screen  coordinates.  A vector  from 
(Xs,Ys)  to  (Xe,Ye)  of  intensity  1 is  described  by  5 9-bit  words 
consisting  of: 


Xs 

Ys 

Xe 

Ye 

I 

Without  loss  of  generality,  assume  Ys  S Te.  We  assume  the  existence  of 
some  computer  capable  of  generating  this  list  every  1/30  second  if  real 
time  motion  is  required  of  the  entire  image.  All  transformations 
(rotation,  scaling,  etc.)  are  assumed  to  have  been  performed  on  this 
data.  Alphanumeric  data  and  other  graphics  commands  could  easily  be 
accomodated,  but  are  not  relevant  for  this  discussion. 

The  raster-scan  conversion  procedure  described  here  consists  of 
several  sort  and  merge  operations  on  the  image  data.  Referring  to 
FI'  URE  1,  we  first  sort  the  input  data  into  ascending  order  of  Ys, 
using  a Y-sort  microprocessor.  This  produces  the  Y-sorted  vector  list. 
Next,  using  a scan  line  processor,  we  produce  a standard  raster  scan 
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video  signal.  Each  of  the  pipelined  processors  communicates  with  the 
next  one  by  shared  memory  buffers.  Double  buffers  are  used  so  that  the 
Y-sort  processor  can  be  processing  frame  n+1  while  the  scan  line 
processor  is  processing  frame  n from  the  second  buffer.  For  the  Y-sort 
processor,  a bucket  sort  would  be  appropriate  with  512  buckets,  each  of 
variable  size.  This  suggests  a data  structure  consisting  of  a set  of 
312  linked  lists,  one  list  corresponding  to  the  Y value  of  each  scan 
line.  Also,  while  sorting,  the  Y-sort  processor  should  replace  Xe  with 
dx/dy  = (Xs-Xe) / (Ys-Ye) , calculated  to  18  bits  percision.  A moment's 
reflection  will  show  that  18  bits  are  needed  to  specify  the  slope  with 
the  same  precision  contained  in  the  original  data.  To  process  2000 
vectors  in  1/30  second  requires  a processor  fast  enough  to  process  one 
vector  each  33  us,  on  the  average.  This  corresponds  to  about  200  to 
300  instruction  executions.  A count  of  executed  instructions  in  a 
small  program  written  for  the  INTEL  3000  series  microprocessor  shows 
that  this  "fast  microprocessor"  can  handle  the  sort,  slope  calculation, 
and  linked  list  manipulation  in  the  required  time.  This  microprogram 
performed  the  division  in  about  150  instructions,  leaving  50  to  150 
Instructions  for  the  rest  of  the  processing.  The  memory  requirements 
of  this  Y-sort  are  24000  9 bit  words.  This  provides  for  two  buffers 
each  capable  of  holding  the  sorted  data  lists.  Double  buffering  is 
used  so  that  the  Y-sort  processor  can  sort  the  data  for  frame  n+1  using 
one  buffer  while  the  scan  line  processor  (described  below)  processes 
data  for  frame  n using  the  other  buffer.  Each  list  entry  consists  of 
five  9-bit  words  of  data  and  one  9-blt  link  pointer  to  the  next  entry 
in  the  list.  Ys  need  not  be  stored  with  each  vector. 


B.  scanlinl;  processor  for  line  drawings 


As  each  scan  line  is  processed,  an  ACTIVE  LIST  of  all  vectors 
intersectea  by  the  current  scan  line  is  maintained.  New  entries  to  the 
active  list  are  taken  from  the  top  of  the  sorted  vector  list  produced 
by  the  Y-sort  processor.  Vectors  are  deleted  from  the  active  list 
while  processing  the  last  scan  line  which  intersects  them.  Each  entry 
in  the  ACTIVE  LIST  consists  of: 


Xc 

dx/dy 

Ye 

I 

where  Xc  is  initially  set  to  Xs.  For  line  drawings  using  this  method, 
there  is  no  need  to  sort  the  ACTIVE  LIST.  Hence  the  scan  line 
processor  simply  processes  each  entry  in  the  previous  scan  line's 
active  list  and  then  processes  entries  at  the  top  of  the  sorted  input 
data  list  (if  any)  that  are  intersected  by  the  current  scan  line.  The 
processing  done  to  an  entry  from  either  of  these  two  sources  is  the 
same.  It  consists  of: 

1.  Calculate  Xc'  = dx/dy  + Xc. 

2.  Place  intensity/color  I in  a 512  word  scan  line  buffer  at  all 
x-locations  between  Xc  and  Xc' . 

3.  If  this  vector  will  be  intersected  by  the  next  scan  line  (i.e. 
Ye  f y-value  of  current  scan  line)  place  this  vector  into  a 
second  ACTIVE  LIST  buffer,  replacing  Xc  by  Xc' . Otherwise  drop 
this  vector  from  the  ACTIVE  LIST  by  not  placing  it  into  the  second 
buffer.  This  second  buffer  will  be  used  as  the  primary  buffer  on 
the  next  scan  line. 


The  memory  requirements  for  this  process  consist  of  two  buffers 


r 
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to  double  buffer  Che  active  list  each  consisting  of  2500  9-biC  words. 
Also,  two  512  word  scan  line  buffers  for  holding  the  intensities  (the 
result  of  this  process)  are  needed. 

The  speed  requirements  of  this  processor  are  rather  high.  For  a 

i 

' maximum  size  ACTIVE  LIST  of  500  entries,  the  processor  must  process  one 

entry  approximately  every  130ns.  With  today's  "fast  microprocessor" 
speeds,  this  is  an  impossible  situation.  For  realistic  Images,  this 
maximum  size  should  seldom  be  reached,  thus  relaxing  the  speed 
requirement  and  allowing  the  use  of  one  or  more  "fast  microprocessors" 
if  one  is  willing  to  use  a sCaClsClcally  average  length  active  list  and 
several  512  word  scan  line  buffers  to  feed  the  video  generator  while 
processing  scan  lines  with  long  active  lists.  This  would  normally  have 
no  effect  on  real  time  motion  of  the  images.  However,  if  all  line 
buffers  were  empty  whan  the  video  generator  requested  the  next  line, 
the  display  could  not  continue  at  Che  normal  rate.  An  unmodified  TV 
display  will  usually  not  operate  at  a reduced  rate.  Some  type  of 
display  with  a variable  scan  line  processing  rate  would  be  well  suited 
for  this  approach. 


However,  a relatively  simple  special  purpose  hardwired  processor 
can  perform  the  required  function  for  500  vectors  in  real  time  for  each 
line.  A design  of  such  a unit  is  given  in  FIGURE  2.  For  comparison 
purposes,  we  will  assume  two  scan  line  buffers  and  the  special-purpose 
hardware  processor  of  FIGURE  2.  A more  detailed  design  of  this 
processor  has  been  done  using  the  readily  available  7400  series  logic 
family.  It  is  ImplemenCable  for  about  the  hardware  cost  of 
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implementing  a "fast  microprocessor"  CPU. 


To  actually  drive  a TV  or  other  raster  device,  a hardware  video 
generator  will  also  be  needed  which  will  accept  one  512  word  buffer  of 
intensity  information  for  each  scan  line  and  generate  the  video  signal. 
A similar  video  generator  is  needed  for  the  frame-buffer  method  also, 
so  its  cost  will  not  be  considered  in  comparing  the  methods. 
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C.  SCANLINE  AND  ACTIVE-LIST  PROCESSORS  FOR  SURFACES. 

Shaded  surface  Images  can  be  processed  in  much  the  same  manner  as 
line  drawings.  In  this  case  we  assume  that  each  vector  in  the  ACTIVE 
LIST  represents  the  left  side  of  a planar  polygon.  This  surface  is 
assumed  to  extend  to  the  right  until  reaching  the  next  line  to  its 
right  in  the  ACTIVE  LIST.  This  requires  the  ACTIVE  LIST  to  be  sorted 
on  Xc  from  left  to  right.  Note  that  this  representation  of  surfaces 
does  not  contain  a separate  right  side  for  polygon  boundaries,  so 
overlapping  surfaces  cannot  be  represented.  If  the  image  data  should 
contain  two  overlapping  surfaces,  one  or  the  other  of  the  surfaces  must 
be  displayed.  If  we  do  not  display  surfaces  of  less  than  one  raster 
unit  in  width  (below  the  display  resolution),  then  the  Integer  part  of 
all  Xc  values  for  all  vectors  In  the  active  list  at  any  one  time  will 
be  different  (except  for  overlapping  surfaces  with  a common  left  side). 
Thus  the  active  list  can  be  stored  as  a continuous  vector  in  memory, 
using  512  consecutive  addresses  with  each  address  associated  with  some 
integer  value  of  Xc.  New  entries  can  be  easily  placed  in  the  correct 
position  in  the  active  list.  Entries  may  change  places  each  scan  line 
as  Xc  Is  updated. 


The  scan  line  processor  for  this  scheme  is  similar  to  the  processor 
for  line  drawings.  A block  diagram  is  given  in  FIGURE  3.  This 
processor  also  merges  new  active  list  entries  for  scan  line  n+l  into 
the  active  list  while  processing  scan  line  n.  These  new  entries  are 
easily  placed  directly  into  the  proper  position  in  the  active  list. 
With  2000  vectors  spread  over  512  scan  lines  there  will  only  be  about  4 
additions  per  scan  line  on  the  average.  However,  this  number  could 
vary  up  to  500  additions  for  some  unusual  Images. 

A design  that  handles  either  lines  or  surfaces  is  only  slightly 
more  complex  than  either  FIGURE  2 or  3,  and  would  be  the  more 
reasonable  implementation  choice.  It  is  simply  the  union  of  the  main 
parts  of  both  FIGURES  2 and  3 with  a few  switches  located  at  points 
where  the  two  diagrams  differ. 

Since  the  active  list  data  is  sorted  on  Xc  there  is  no  need  for  the 
512  word  scan  line  buffers  as  before.  Instead  a single  word  register 
contains  the  current  beam  Intensity /color . The  processing  of  the 
active  list  can  be  synchronized  to  the  current  X-value  of  the  raster 
scan.  As  the  raster  sweeps  across  a scan  line  from  left  to  right,  X 
changes  by  one  unit  each  132ns.  The  current  active  list  buffer  is 
scanned  in  synchronism  at  this  rate.  When  encountering  a non-empty 
entry  (i.e.  the  left  side  of  some  surface  is  encountered),  data  is 
loaded  from  the  active  list  into  a register.  The  I portion  of  the  data 
in  this  register  continually  specifies  the  beam  Intensity /color.  An 
adder  (which  easily  works  in  132ns)  adds  dx/dy  to  Xc,  and  the  register 
contents  are  placed  in  a new  active  list  buffer  at  location 
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Xc+  dx/dy  (provided  Ye  * current  scan  line  y-value).  This  new  active 
list  buffer  will  be  the  input  buffer  for  the  next  scan  line. 


The  memory  required  for  scan  line  processing  of  surfaces  is  two 
2048  9-bit  word  active  list  buffers  (we  need  not  explicitly  store  the 
integer  part  of  X for  each  active  list  entry).  No  512  word  line  buffer 
is  needed. 

The  cost  (complexity)  of  such  a special-purpose  processor  as 
estimated  from  a design  using  7400  series  logic,  is  approximately  the 
same  as  for  the  processor  described  in  section  B.  A comparable  video 
generator  is  also  needed  as  in  B. 

D.  POSSIBLE  MODIFICATIONS/ENHANCEMENTS  TO  THE  PROCESS 

If  the  scan  line  processor  was  implemented  in  software  or  used  a 
slower  inexpensive  microprocessor  that  was  not  able  to  execute  the 
algorithm  within  one  frame  time  on  some  complex  pictures,  a modified 
design  could  be  used  so  that  the  X position  on  each  scan  line  where  I 
changes  value  would  be  stored  in  an  encoded  manner  in  a buffer.  The 
length  of  this  buffer  is  proportional  to  the  picture  complexity,  and  is 
generally  much  smaller  than  a frame  buffer  memory.  It  could  be  used  to 
keep  the  display  refreshed  for  many  frames  by  a relatively  simple 
hardware  device  to  generate  the  video  signal.  This,  of  course, 
precludes  displaying  a different  image  on  each  frame. 

One  advantage  of  partitioning  the  processing  between  the  Y-sort  and 
the  scan  line  processors  in  the  manner  described  above  is  that  several 
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hidden  surface  algorihms  which  produce  raster  scan  output[7,8,9]  use, 
at  an  Intermediate  step,  an  ACTIVE  LIST  containing  the  data  sorted  in 
the  same  order  as  the  ACTIVE  LIST  described  above.  Thus  it  would  be 
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possible  to  replace  the  scan  line  processor  described  above  with  a more 
complex  processor  that  would  discover  the  overlapping  surfaces  from  the 
ACTIVE  LIST  and  produce  a display  with  hidden  surfaces  removed.  The 
•Y-sort  processor  could  still  be  used  to  produce  the  active  list.  In 
this  case  the  I value  in  the  ACTIVE  LIST  would  normally  contain  an 
identification  number  for  each  surface.  Also,  an  even  number  of 
entries  would  appear  in  the  ACTIVE  LIST  for  each  surface,  corresponding 
to  the  edges  where  the  scan  ray  enters  and  exits  the  surface  as  it 
moves  left  to  right  along  one  horizontal  scan  line.  The  Intensity  of 
each  different  surface  is  supplied  by  indexing  in  an  intensity  table, 
uning  the  surface  identification  number  as  index.  If  such  a processor 
was  not  capable  of  processing  the  entire  image  in  one  frame  time,  the 
intensity  change  buffer  Just  described  could  be  used  to  buffer  the 
image  for  several  frames. 

IV.  COMPARISON  OF  FRAME-BUFFER  AND  PROCESSOR  METHODS 

Table  1 compares  the  performance  limiting  factors  of  the 
frame-buffer  and  processor  approach.  A frame  buffer  for  512x512 
resolution  requires  262,144  9-bit  words  of  memory.  The  processor 
approach  described  in  this  paper  requires  only  30,000  '9-bit  words  of 
remory  (29,000  words  for  surfaces),  or  about  1/9  as  much  memory  as  the 
frame  buffer  method. 

The  frame  buffer  approach  requires  some  processing  power  to 
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generate  the  intensity  patterns  for  vectors  fron  their  endpoint 
descriptions.  To  meet  the  2000  vector  per  frame  specification  this 
requires  a processor  capable  of  processing  1 vector  and  storing  the 
results  in  a frame  buffer  each  17us.  In  the  processor  approach,  the 
scanline  processor  performs  this  function  on  the  fly.  The  main 
difference  between  methods  here  is  that  the  processor  approach  must  do 
this  calculation  in  real  time,  whereas  the  frame  buffer  approach  may  do 
it  more  slowly  at  the  expense  of  real  time  motion. 


The  frame  buffer  approach  has  no  component  corresponding  to  the 
Y-sort  processor.  Therefore,  the  equipment  tradeoff  between  the  two 
methods  is  a Y-sort  "fast  microprocessor"  and  a scan  line  hardware 
processor  vs.  232,000  9-blt  words  of  memory  and  enough  host  processing 
power  to  generate  the  intensity  patterns  for  vectors.  Since  the  scan 
line  processor  performs  essentially  the  same  algorithm  , we  may , to  a 
first  approximation,  equate  the  scan  line  processor  to  the  cost  of  the 
host  processing  power  needed  to  generate  the  intensity  frame  buffer 
patterns  for  individual  vectors.  An  informal  survey  of  the  current 
literature  shows  that  232,000  words  of  memory  has  a cost  many  times 
that  of  a "fast  microprocessor".  We  are  considering  only  component 
costs  here,  supposing  the  fabrication  costs  for  232,000  words  of  memory 
is  approximately  equal  to  assembly  costs  of  a "fast  microprocessor". 
This  assumption  is  based  on  today's  approximately  equal  lnL..grated 
circuit  count  for  both  the  memory  and  processor  described  by  FIGURE  3. 

V.  SUMMARY 

The  algorithm  and  suggested  implementation  using  microprocessors  is 
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not  proportcd  to  be  the  best  such  algorithm  or  implementation. 
However,  it  does  show  the  capability  of  a microprocessor  and  a small 
special  purpose  processor  to  perform  the  raster  scan  conversion 
process.  Thus  the  use  of  this  technique  appears  feasible. 
Economically,  we  conclude  that  unless  the  ratio  of  processor  to  memory 
costs  changes  drastically  from  its  current  value,  implementation 
without  a frame  buffer  appears  to  be  preferred,  based  on  today's 
component  costs,  for  systems  wlh  high  resolution  or  real  time  motion. 
Less  readily  comparable  differences  in  the  two  raster  conversion 
processes  are  the  maximum  picture  complexity  limits  Imposed  by 
processor  speeds  vs.  the  maximum  real-time  motion  picture  complexity 
imposed  by  frame  buffer  memory  speed  and  host  bit-map  generating  speed. 
Also  not  readily  comparable  are  the  differences  in  software  required  by 
the  loss  of  a frame  buffer  and  the  addition  of  a vector  list  describing 
the  image.  The  frame  ^buffer  allows  easy  reading  of  the  current  image 
at  a given  (x,y)  point.  Lieberman( 10]  notes  that  this  makes  it  easy 
to  discover  the  edges  of  any  enclosed  region  in  the  image,  or  to  find 
one's,  way  out  of  a maze.  On  the  other  hand,  the  existence  of  a vector 
list  describing  an  image  allows  transformations  to  be  performed  easier. 
A frame  buffer  system  by  Carrett[ll]  even  includes  a vector  list  for 
this  purpose.  Note  that  in  the  approach  described  in  this  paper, 
real-time  motion  of  the  entire  displayable  image  is  automatic,  provided 
the  host  computer  or  yet  another  dedicated  microprocessor  can  generate 
the  input  data  in  real  time.  On  the  other  hand,  with  a frame  buffer, 
any  arbitrary  memory  Intensity/color  pattern  can  be  displayed  (flicker 
free),  even  with  moderately  slow  memory,  so  long  as  it  does  not  ail 
move  in  real  time. 
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Performance 
parameter  effected. 

Limited  in  frame- 
buffer approach  by: 

Limited  in  processor 
approach  by: 

XY  resolution 

size  of  frame  buffer 

speed  of  scanline 
processor. 

Complexity  of  still 
picture 

No  limit  within 
resolution. 

Speed  of  all  processors 

Complexity  of  real- 
time motion  Images. 

Ability  of  host  to 
generate  coordinates  in 
real  time. 

Speed  of  host  CPU  to 
interpolate  between 
vector  endpoints. 

Abiliy  of  host  to 
generate  coordinates 
in  real  time. 

Speed  of  scanline 
processor. 

Write  speed  of  frame 
buffer  memory  to  accept 
new  image  bit  map. 

Speed  of  Y-sort 
processor. 

TABLE  1 . 

FRAME  BUFFER  - PROCESSOR  APPROACH  COMPARISON 
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A GRAPHICALLY-PROGRAMME'',  MICROPROCESSOR-BASED 
INDUSTRIAL  CONTROLLER 
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ABSTRACT 


While  programmable  industrial  controllers  have  benefited  from  continuous 
technological  improvements  since  their  introduction  in  1969,  considerably  less 
effort  has  been  expended  on  the  equally  important  questions  of  how  to  program 
an  industrial  controller,  what  type  of  programming  language  to  use,  and  how 
the  controller  should  interface  with  its  human  user.  The  advent  of  the  micro- 
processor has  encouraged  speculation  that  its  use  in  an  industrial  controller 
can  both  simplify  system  hardware  and  expand  system  flexibility  by  permitting 
sophisticated  system  software. 

This  paper  discusses  the  applicability  of  microprocessors  to  an 
industrial  environment  and  suggests  a new,  graphic  programming  language  based 
on  familiar  relay  symbols  as  the  system's  input.  A design  overview  is  presented 
for  two  stand-alone,  microprocessor-based  machines  — the  controller  itself,  which 
interprets  a user  program  and  manages  system  I/O,  and  an  auxiliary  program  loader 
which  supervises  interactive  graphic  programming  using  a special  keyboard  and 
CRT.  When  connected,  the  two  units  communicate  to  provide  the  user  with  the 
capability  of  monitoring  and  changing  a running  system. 

1.  INDUSTRIAL  PROCESS  CONTROLLERS 

The  process  control  systems  of  the  early  1960s  were  essentially 
all-relay  systems  and  suffered  from  a number  of  critical  deficiencies: 
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1.  Each  new  application  meant  a new,  custom  design; 

2.  Each  new  design  implied  expensive  field  wiring; 

3.  Hardware  costs  were  high; 

4.  Relay  hardware  was  unreliable  without  extensive  field 
maintenance;  and 

5.  Systems,  once  installed,  offered  no  flexibility  for  changing 
or  upgrading  the  control  mechanism. 

The  development  of  the  microprocessor,  coupled  with  the  adaptation 
of  computer  science  software  technology,  is  bringing  forth  a new  generation 
of  industrial  controllers  which  substitutes  programming  for  field  wiring 
and  CRT  debug  monitors  for  continuity  testers. 

Programmable  controllers  are  useful  in  a host  of  industrial  applications. 
Some  typical  applications  include: 

conveying  systems  sorting  mechanisms 

pumping  stations  assembly  operations 

labeling  transfer  lines 

machine  tools  welding  controls 

While  the  major  thrust  here  is  the  design  of  the  system  software  for 
such  a controller  (determination  of  an  effective  programming  language  and  design 
of  two  operating  systems,  a graphic  translator,  and  a communications  package 
between  dual  processors),  a successful  implementation  requires  that  due 
consideration  be  given  to  the  hardware  environment  in  which  the  softv/are  is 
to  operate.  By  designing  both  hardware  and  software  simultaneously,  we  have 
attempted  to  achieve  the  proper  balance  between  functions  implemented  in  hard- 
ware and  those  implemented  in  software. 


packaging  operations 
production  testing 
batch  sequencing 
baggage  routing 


2.  CONTROL  SEQUENCE  SPECIFICATION 


In  the  earlier  industrial  controllers  the  decision-making  logic 
functions  were  performed  with  either  electro-mechanical  relays  or  custom. 
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hard-wired,  static  logic  devices.  The  documentation  for  the  control  function 
of  such  systems  was  the  electrical  drawing  defining  how  the  relays  and/or  logic 
devices  were  to  be  interconnected  to  form  the  control  sequence.  This  "team" 
of  relays  and  control  diagrams  has  been  used  successfully  for  many  years. 

The  control  diagram  which  defines  the  relationships  among  the  input,  output, 
and  control  variables  is  called  a "relay  ladder  diagram"  (RLD)  due  to  its 
characteristic  symbols  and  format  [1,2]. 

Described  briefly,  the  RLD  uses  the  leftmost  vertical  line  to 
represent  a source  of  electrical  power  ("power  line");  current  propogates 
through  the  matrix  as  relays  enable  or  inhibit  current  flow  along  the  horizontal 
paths.  If  current  does  reach  an  output  (indicated  by  a circle)  in  the  right- 
most column  the  output  will  turn  on,  else  it  will  turn  off.  See  Figure  1, 
a sample  RLD,  and  Figure  2,  a legend  for  some  standard  symbols. 
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3.  PROGRAMMING  WITH  THE  CRT 


The  "program"  is  entered  via  pushbuttons  on  a custom  keyboard  and 
displayed  on  a CRT.  The  individual  "pages"  of  an  RLD  are  defined  to  be  a 
programming  matrix  of  four  rows  and  eight  columns.  Each  matrix  position  in 
the  first  seven  columns  may  contain  a space,  horizontal  connection,  normally 
open  relay,  or  normally  closed  relay.  A vertical  connection  to  the  line  below 
may  optionally  be  added  to  any  symbol  on  line  one,  two,  or  three.  The  relay 
element  addresses  may  refer  to  any  input,  output,  control  variable,  latch, 
timer  coil,  counter  coil,  or  any  other  internal  variable;  thus,  every  variable 
in  the  system  may  potentially  participate  in  the  control  of  any  output.  Column 
eight  is  reserved  for  outputs  (normal  or  complemented,  retentive  or  nonretenti ve, 
timer/counter  coils,  etc.).  Any  page  may  contain  one  to  four  outputs,  one  per 
line.  There  are  no  formatting  restrictions  with  regard  to  the  interconnections 
on  a single  page. 
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Having  selected  a matrix  position  with  the  cursor,  any  of  the  relay  symbols  may 
now  be  inserted.  The  I/O  address  of  the  element  (the  physical  terminal  number 
to  which  that  input  or  output  is  attached)  is  entered  on  a keypad  containing 
the  digits  0 to  9.  Each  address  digit  entered  shifts  the  current  3-digit 
element  address  field  (initially  000)  one  digit  to  the  left  and  inserts  the 
just-depressed  digit  in  the  least-significant  digit  position.  Depressing  any 
of  the  output  symbols  automatically  advances  the  cursor  to  column  eight,  filling 
intervening  blank  elements  with  horizontal  connections,  and  inserts  the  output 
symbol . 

If  an  error  is  made  while  programming  a given  page,  the  user  need 
only  reposition  the  cursor  to  the  matrix  position  in  error  and  press  the  symbol 
and/or  address  keys  corresponding  to  the  desired  correction.  In  this  manner 
the  user  may  program  any  matrix  position,  in  any  order,  as  well  as  edit  any 
number  of  positions  in  any  order.  Any  symbol  may  be  added,  deleted,  or  altered 
with  only  a few  keypresses.  After  the  page  has  been  visually  inspected,  depressing 
NEXT  PAGE  in  program  mode  saves  the  current  page  and  displays  a blank  next  page; 
in  edit  mode  it  replaces  the  saved  page  with  the  page  just  edited  and  then  displays 
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the  next  page  as  recalled  from  memory.  For  ease  of  editing,  accuracy  of 
documentation,  and  enhancement  of  visual  verification  of  correctness,  the 
translation  technique  preserves  the  exact  topology  of  the  input  RLD  so  that  it 
can  be  reconstructed  exactly  from  the  internal  code  at  any  later  time. 

When  programming  a timer  or  counter,  the  user  may  choose  to  insert 
his  4-digit  presets/setpoints  directly  through  the  keypad.  All  presets/setpoints 
may  be  changed  dynamically  at  any  time  during  program  execution  for  system  tuning 
or  correction. 

When  all  the  pages  of  a program  have  been  completed,  the  COMPILE 
pushbutton  is  depressed.  The  graphic  translator  software  package  is  then  invoked 
which  translates  the  various  graphic  pages  into  internal  code. 

The  utility  of  the  programming  format  is  illustrated  by  the 
simplicity  of  its  rules: 

1.  There  are  four  rows  and  eight  columns  per  RLD  page; 

2.  Column  eight  is  reserved  for  outputs; 

3.  Verticals  may  not  extend  below  line  four;  and 

4.  Lines  which  cross  are  assumed  to  connect. 

The  user  should  appreciate  the  fact  that  it  is  impossible  to  introduce 
a permanent  syntax  error.  Not  only  is  the  language  designed  for  freedom  of 
expression,  but  the  only  possible  syntax  errors  are  detected  by  the  graphic 
translator: 

1.  Nonoutput  type  symbol  in  column  eight; 

2.  Vertical  entered  on  line  four;  and 

3.  Invalid  I/O  address  for  relay  element. 

In  each  case  the  invalid  keypress  is  immediately  rejected  and  an  audible  beep 
(error  signal)  emitted. 

The  switch  from  program  to  edit  mode  is  painless  and  immediate. 

Since  the  system  decompiles  compiled  code  exactly,  the  user  experiences  no 
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disturbing  format  change  when  he  edits  a program.  To  further  aid  his  editing, 
a SEARCH  function  is  provided  in  which  a relay  symbol  and  address  are  given. 

Each  sequential  depression  of  the  SEARCH  key  locates  and  displays  the  next  page 
of  RED  code  in  which  the  given  symbol  is  used  in  conjunction  with  the  given 
address. 

Other  modes,  including  MONITOR,  FORCE,  and  MEMORY  COPY,  allow  the 
user  to  interactively  debug  a running  program  by  watching  relays  conduct  and 
outputs  close  in  real  time,  or  by  forcing  any  input  on  or  off.  Once  a complete 
RED  has  been  entered,  it  is  compiled  and  stored  in  a lKx8  RAM  within  the  program 
loader.  After  further  execution,  testing,  and  possible  alteration  in  RAM,  the 
program  may  be  permanently  stored  by  inserting  a lKx8  PROM  and  depressing 
a "copy  RAM"  key;  the  PROM  will  be  automatically  programmed  to  duplicate  the 
content  of  the  RAM.  Eikewise,  a previously  compiled  program  stored  in  PROM 
may  be  transferred  to  RAM  by  depressing  a "copy  PROM"  key;  the  program  will 
be  decompiled  and  made  ready  for  display,  monitoring,  and/or  editing.  Programs 
in  RAM  and  PROM  may  be  verified  to  be  identical  by  use  of  the  VERIFY  key;  a 
CRT  message  reports  the  address  and  content  of  any  mismatch. 

In  summary,  the  graphic  programming  "langauge"  described  provides 
numerous  benefits  over  any  other  programming  system  currently  available.  Its 
use  of  standard  relay  symbols,  keypress  programming,  easy  program  entry  and 
editing,  and  reliance  on  visual  verification  of  correctness  allow  the  user 
to  capitalize  on  his  intuition;  drawing  his  RED  on  the  CRT  is  no  more  trouble 
than  drawing  it  on  paper. 

From  the  user's  point  of  view,  the  most  important  advantages  accrue 
from  the  unrestricted  format  of  the  RED  input  (unique  to  the  industry),  the 
ability  to  recreate  the  exact  RED  used  as  input  from  the  internal  code  alone 
(also  unique),  and  the  freedom  to  define  up  to  four  outputs  per  relay  page 
(likewise  unique). 
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4.  LANGUAGE  TRANSLATION 

Having  defined  an  acceptable  graphic  programming  language,  a 
significant  problem  still  remains--how  to  translate  the  RLD  into  an  internal 
code  in  such  a way  as  to  extract  its  Boolean  content  while  preserving  its 
topology.  Historically,  the  imposition  of  formatting  restrictions  served 
the  purpose  of  reducing  the  complexity  of  the  translation  process.  But  now 
we  search  for  a suitable  translation  scheme  which  will  neither  impose  format 
restrictions  nor  be  so  complicated  as  to  strain  the  space  and  speed  limitations 
of  a microprocessor-based  system. 

Into  what  should  the  RLD  be  translated?  The  two  basic  alternatives 
are  to  produce  executable  code  for  the  chosen  microprocessor  or  to  produce 
an  intermediate  text  which  can  be  interpreted.  Although  the  former  offers  the 
advantage  of  high-speed  execution,  the  latter  scheme  will  conserve  memory  space 
as  well  as  simplify  conversion  to  a different  microprocessor  in  some  future 
version  of  the  hardware. 

The  compilation  of  a graphic  RLD  into  directly  executable  microprocessor 
code  may  be  accomplished  by  Boolean  equation  templates  or  recursive  maze 
traversal,  producing  either  true  Boolean  equations  or  their  parse  tree,  as 
desired  [3,4,5].  The  direct  advantage  of  either  technique  is  that  both  the 
compilation  and  optimization  techniques  are  well  known  [6,7,8]  and  could  be 
expected  to  produce  high  quality  microprocessor  code.  On  the  other  hand,  the 
number  of  terms  in  the  equation  and  the  complexity  of  the  optimization  pass 
would  require  a large  amount  of  microprocessor  memory  for  both  the  translation 
program  (in  ROM)  and  the  working  store  (in  RAM).  The  time  of  translation  would 
be  extended  by  the  optimization  pass,  although  this  is  a minor  consideration. 

Most  importantly,  the  optimization  of  the  microprocessor  code  would  prohibit 
using  the  code  sequence  itself  to  reconstruct  the  original  RLD. 
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Benchmark  programs  containing  approximately  300  relay  symbols  have 
been  compiled  into  directly  executable  code  for  the  most  popular  microprocessors 
(Intel  8080,  Motorola  6800,  MOS  Technology  6502,  Zilog  Z-80)  and,  without 
significant  exception,  produce  approximately  3000  bytes  of  code  with  an 
execution  time  of  about  4 ms.  While  the  execution  time  is  well  within  the 
desired  range,  the  memory  requirement  is  not.  The  information  density  (memory 
bytes  required  per  relay  symbol)  is  low  due  to  the  bit  masking  and  shifting 
required  by  the  microprocessor  architecture  and  the  overhead  of  keeping  two 
copies  of  the  output  and  control  variables  (necessary  to  guarantee  equation 
independence) . 

The  excess  memory  requirement  is  not  the  only  problem,  however. 

Since  the  code  embodies  the  logic,  but  not  the  topology,  of  the  RLD  input, 
the  original  geometry  cannot  be  recovered  from  the  microprocessor  code  alone. 
Additionally,  execution-time  optimization,  such  as  recognizing  that  the 
evaluation  of  a lengthy  expression  is  unnecessary  if  the  result  is  to  be 
ANDed  with  a zero,  is  possible  only  at  the  expense  of  even  more  memory. 

Further,  should  the  system  be  programmed  by  hand,  rather  than  by  the  graphic 
translator,  using  microprocessor  code  as  the  source  language  would  require 
the  user  to  be  very  familiar  with  a specific  microprocessor  architecture 
and  its  assembly  language. 

In  summary,  the  main  advantage  of  directly  executable  microprocessor 
code  is  its  speed  of  execution;  however,  it  does  not  facilitate  recovery  of  the 
input  RLD,  execution-time  optimization,  or  hand  programming. 

An  interpretive  code  can  be  used  to  increase  the  information  density 
as  well  as  to  encode  the  geometry  of  the  input  RLD  at  the  expense  of  a larger 
software  system  (an  interpreter)  and  additional  execution  time.  Still,  such 
a tradeoff  would  be  economically  favorable  if  it  reduced  the  investment  in 
PROM  chips  required  for  the  translated  RLD  program  by  more  than  it  increased 
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the  investment  in  ROM  chips  required  for  the  whole  operating  system  (assuming 
execution  time  remained  satisfactory).  If  the  interpretive  code  is  column- 
oriented  and  evaluated  left-to-right,  software  can  detect  rather  easily  when 
current  propagation  has  ceased  and  can  jump  directly  to  the  output  column  and 
begin  storing  outputs.  This  ability  to  dynamically  skip  the  evaluation  of 
"non-participating"  relay  columns  recovers  some  of  the  time  lost  to  interpretation. 

To  define  an  adequate  column-oriented  code,  note  that  in  any  one  of 
seven  relay  columns  on  a page  there  can  be  only  16  positional  combinations  of 
relays  in  any  one  column;  therefore,  we  define  16  distinct  "relay  column 
patterns."  The  pattern  number  (0  to  15)  identifies  the  geometry  of  the 
column  (pattern  # 0 indicates  no  relays,  pattern  # 1 indicates  one  relay  on 
line  4,  . . .,  pattern  # 15  indicates  one  relay  on  each  of  lines  one,  two, 
three,  and  four).  Additional  memory  bytes  identify  the  particular  I/O  address 
of  each  relay  element  in  the  column.  One  additional  byte  indicates  which,  if 
any,  of  the  relay  elements  are  complemented  (normally  closed)  and  which,  if  any, 
of  the  three  possible  vertical  connections  between  lines  are  present.  An  example 
follows. 


COLUMN 


CODE 


I 


N, 


•- 


N, 


12 


40 


f Ni 


Vi 

N'z 

Vz 

N3 

V3 

n; 


PATTERN  #7 

4 

0 

0 

3 

1 

2 

0010 

101  0 

column  type  and 
|)attern  code 


y I/O  addresses 


"complement/ vertical " 


J L 


verticals  on  lines  1 and  3 
complement  on  line  3 
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"Executing"  the  column  now  reduces  to  fetching  the  dynamic 

status  of  the  relay  elements  of  the  column,  R.  i = 1,2, 3, 4,  ANDing  this 

* >J 

4-bit  vector  with  the  left-hand  nodes  N.  ,-1,1  = 1,2, 3, 4,  and  creating 

1 » J " I 

I 

a new  set  of  right-hand  nodes  N.  i = 1,2, 3, 4,  after  accounting  for 

' »J 

the  influence  of  complemented  variables  and  additional  propagation  due  to 
vertical  connections.  Since  the  "solution"  of  the  column  conduction  problem 
is  a function  of  fifteen  variables. 


i = 1,2, 3, 4,  k = 1,2,3, 


it  is  speedily  solved  by  table  lookup. 

Likewise,  the  output  column  (column  eight)  could  have  been  segmented 
into  different  patterns  for  each  different  type  of  output  column  possible,  but 
the  large  number  of  possibilities  makes  this  approach  impractical.  Rather,  the 
output  column  code  identifies  the  code  segment  as  an  output  column,  identifies 
the  lines  from  which  an  output  is  stored,  indicates  if  it  is  complemented  and 
to  what  output  address  the  computed  result  is  to  be  stored  (see  below). 


COLUMN 


^1  • 0 • 

22 

N2  • • 

N,. O • 

23 

N4  • • 


CODE 


OUTPUT 

CODE 

1000 

1010 

1 1 1 1 

Ni  i 

2 

2 

r* 

► 

3 

column  type  code 
"complement/pattern"  byte 


store  from  M,|  and 


> output  addresses 


fThe  code  segments  for  each  page  are  assembled  in  column  order,. 

left  to  right,  and  surrounded  by  a page  marker  and  a pointer  to  the  output 
column  code  as  shown  below. 


PAGE  MARKER 


OUTPUT  COLUMN  POINTER 

I 

— 

COLUMN  1 



• 

• 

• 

f 

1 

1 

COLUMN  j 

1 

• 

• 

• 

►! 

— 

t 

1 

OUTPUT  COLUMN 



Whenever  execution  detects  that  a newly  calculated  set  of  nodes 

I 

N-  •,  i = 1,2, 3, 4 are  all  zero,  the  internal  program  counter  is  advanced  to  the 

output  column  code,  thereby  omitting  both  interpretation  and  execution  of  the 
intervening  column  code.  Finally,  the  code  for  all  pages  is  assembled  along 
with  prelude  ancl  postlude  blocks  which  simplify  system  startup  and  optimization. 
The  advantages  of  the  scheme  are  numerous,  including: 

1.  Simplicity--simple  enough  to  hand  code  if  necessary; 
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2.  Fast--worst  case  (no  optimization  branches)  for  the  300  symbol 
benchmark  program  is  25  ms; 

3.  Economical --a  lKx8  program  PROM  (1  chip)  will  hold  a 600  symbol 
RLD; 

4.  Input  recovery--the  exact  topology  of  the  input  RLD  is  wholly 
specified  within  and  recoverable  from  the  internal  code;  and 

5.  Moderate  software  complex! ty--the  operating  system  for  the 
controller  (including  the  interpreter)  is  2K  bytes;  the 
operating  system,  interface,  and  graphic  translator/editor/ 
monitor  for  the  program  loader  is  8K  bytes. 

i 

5.  THE  TOTAL  SYSTEM 

In  addition  to  graphic  compilation,  the  program  loader  provides  complete 
programming  and  editing  services  in  a stand-alone,  interactive  environment. 
Additional  modes  of  operation  allow  the  loader  to  be  used  as  an  in-the-field 
monitor.  Putting  the  loader  in  MONITOR  mode  allows  the  user  to  gain  access  to 
a running  program  in  a controller,  copy  it,  decompile  it  into  the  exact  original 
geometry  and  watch  relay  contact  closures  displayed  in  real  time  on  the  CRT 
screen.  A FORCE  mode  permits  the  user  to  isolate  any  number  of  inputs  and 
outputs  and  force  them  on  or  off  to  test  both  programming  logic  and  field 
connections. 

Communication  between  the  program  loader  and  controller  is  accomplished 
by  an  interrupt-driven  scheme  in  which  the  loader  can  request  status  information 
from  the  controller  and/or  override  controller  calculations.  The  two  independent 
units  communicate  through  PIAs  connected  via  a single  cable.  The  modular 
design  insures  that  one  program  loader  may  service  all  of  the  controllers  in 
any  one  application,  thus  reducing  total  system  cost. 


Figures  3 and  4 show  the  physical  keyboard,  CRT,  and  mode  selection 
layout  of  the  program  loader.  Figure  5 shows  the  hardware  required  by  the 
communications  package. 

6.  IMPACT  ON  rilCROPROCESSQR  SYSTEM  DESIGN 

This  work  refutes  the  contention  that  microprocessors  are  too 
impotent  to  benefit  from  elaborate  software  systems  (compilers,  interpreters, 
operating  systems,  etc.).  Strict  adherence  to  the  principles  of  structure, 
isolation,  and  modularity  slowed  the  initial  design  and  implementation 
process,  but  paid  off  handsomely  in  the  accuracy  and  reliability  of  the 
final  product.  The  development  cycle  for  hardware  and  software  required 
approximately  two  man-years  each.  The  most  vexing  problem  area  was  the 
implementation  of  the  elaborate  asynchronous  communications  protocol  between 
the  two  independent  processors.  It  is  interesting  to  note  that  a subsequent 
controller  and  programming  unit  with  four  times  the  processing  pov/er  of 
\.his  system  was  designed  and  built  from  scratch  in  eight  man-months  each 
for  hardware  and  software;  the  communications  scheme  was  not  a problem 
here  due  to  prior  experience.  The  clear  implication  is  that,  with  a little 
experience,  progress-i vely  more  elaborate  software  systems  will  be  permitted 
in  a previously  hardware-dominated  domain. 

One  should  not  infer,  however,  that  current  microprocessor  architecture 
is  at  its  zenith.  Rather,  the  expanding  use  of  microprocessors  is  sure  to 
have  a significant  impact  on  future  microprocessor  design.  While  currently 
available  microprocessors  tend  to  follow  the  conventional  architectures  of 
previous  mini-  and  maxi -computers,  increased  usage  in  special  purpose 
applications  will  justify  the  design  of  new,  special  purpose  microarchitectures. 
Although  the  solution  of  relay  ladder  diagrams  is  only  one  instance  of  a 


special  purpose  application,  it  nevertheless  serves  to  illustrate  the  limitations 
of  existing  microprocessors  and  to  suggest  improvements  in  future  products. 

In  the  area  of  industrial  process  control,  five  observations  can  be  made. 


Processor  speed  is  marginal. 

Instruction  times  of  2-5  microseconds  generate  unacceptable  delays 
when  attempting  replacement  of  conventional  hardware  logic  elements  with 
their  software  equivalent.  As  software  systems  become  larger  and  more  complex, 
faster  processors  (perhaps  of  bipolar  technology)  will  be  mandatory.  Until 
then,  discrete  outboard  logic  (e.g.,  PLAs,  miltiplier  units)  will  be  used 
to  augment  standard  microprocessor  capabilities. 


Eight-bit  data  paths  are  too  restrictive. 

Process  control  deals  with  Boolean  control  (1  bit),  BCD  digits 
(4  bits),  ASCII  characters  and  status  vectors  (8  bits),  binary  counters  and 
internal  addresses  (16  bits),  and  data  transfer  (typically  IK  bits).  Few 
microprocessor  instruction  sets  are  as  effective  with  bits  as  they  are 
with  bytes;  as  a result,  excessive  bit-masking,  shifting,  and  packing  reduce 
efficiency.  An  instruction  set  which  effectively  handles  variable  length 
data  (i.e.,  bit,  byte,  word,  and  block)  would  be  a distinct  advantage. 


Instruction  sets  must  provide  multiple  addressing  modes. 

Intel  chose  to  feature  a diversity  of  instruction  types;  Motorola 
chose  to  implement  fewer  instructions  with  more  addressing  modes.  The  latter 
is  preferable.  The  system  software  of  the  controller  alone  used  every 
addressing  mode  in  the  6502  and  could  have  benefitted  from  even  more  elaborate 
ones  (double  indirect,  double  indexed). 
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We  need  more  elaborate  software  support. 


Manufacturer-supplied  development  software  varies  widely  in  its 
quality.  Assemblers  are  well  understood  and  are  generally  adequate; 
simulators  are  not;  compilers  are  a joke.  The  productivity  and  efficiency 
of  microprocessor  system  designers  are  proportional  to  the  quality  of 
software  support  systems  used.  While  PL/M-type  comoilers  are  useful  for 
initial  design  evaluation  and  debugging,  their  code  is  too  inefficient  for 
use  in  production  environment. 

Microprogrammable  architectures  will  blossom. 

The  flexibility  of  defining  system  architecture  via  microcode  is 
extremely  advantageous.  Microprogramming  allows  the  designer  to  generate 
exactly  those  instructions  and  addressing  modes  deemed  most  useful  for  the 
application.  The  cost,  however,  in  design  time,  prototyping,  and  system 
debug  time  (both  hardware  and  software)  is  enormous.  This  situation  is 
due  to  a general  industry  unfamiliarity  with  microprogramming  concepts 
and,  again,  inadequate  hardware  and  software  tools  for  generating  and 
testing  microprograms.  Once  support  is  established,  microprogrammed 
processors  will  be  the  way  of  the  future. 
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Figure  2.  Some  RLD  Symbols 
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Figure  4.  CRT  and  Mode  Selection  Keyboard 


Figure  5.  Communications  Hardware 
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abstract 

A Microprogrammable  Integrated  Data  Acquisition  System 
(MIDAS)  was  developed  as  a compact,  lightweight,  and  economical 
system  for  the  acquisition  of  medium-speed  analog  signals. 

The  system  contains  a microcomputer,  an  Incremental  digital 
cassette  tape  recorder,  a high-speed  16-channel  multiplexed 
analog-to-digital  converter,  a digital  clock,  and  a keyboard 
for  program  or  parameter  entry.  The  details  of  operation 
and  programming  are  described. 


INTRODUCTION 


The  research  and  development  effort  whose  product  is 
described  In  this  work  was  undertaken  In  an  attempt  to 
produce  an  alternative  to  the  large,  cumbersome,  and  expen- 
sive analog  tape  recorder.  A Mlcroprogrammable  Integrated 
Data  Acquisition  System  (MIDAS)  was  the  result. 

Quite  often,  the  analog  tape  recorder  provides  frequency 
response  and  data  volume  not  required  for  a specific  Inves- 
tigation. This  Implies  that  an  optimum  equipment  suite  Is 
not  always  the  most  elegant.  On  the  contrary,  the  optimal 
suite  carefully  mates  enough  samples  for  the  resolution 
required.  With  the  analog  tape  recorder,  data  are  presented 
as  a continuum,  and  must  be  discretized. 

An  Initial  attempt,  using  discrete  logic  devices, 
although  feasible,  had  limited  utility  because  of  Its  hard- 
wired configuration.  It  did  provide  excellent  speed  and 
reliability  In  taking  data  samples  at  a fixed  rate  from  a 
fixed  number  of  analog  devices  and  recording  digitized 
values  on  magnetic  tape.  Although  this  was  a substitute 
for  the  analog  recorder.  It  did  not  provide  the  flexibility 
which  was  desired,  since  It  ran  like  the  analog  machine  at 
a limited  number  of  speeds,  and  was  constrained  by  hard- 
wiring to  do  only  one  job. 

In  order  to  provide  a large  degree  of  flexibility,  a 
microcomputer  was  used  to  replace  the  hard-wired  discrete 
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logic  devices.  The  microcomputer  gave  to  MIDAS  the  same 
capabilities  which  a maxi-  or  mini-computer  would  give,  with 
a tolerable  penalty  In  speed  and  memory  capacity.  Programs 
could  be  written  in  high-level  or  assembly  language  to  re- 
place the  functions  once  performed  by  discrete  logic  devices. 
Thus,  MIDAS  was  capable  of  sampling  a variable  number  of 
channels  a variable  number  of  times  (variable  among  the 
channels) , upon  some  variable  external  command  or  sequence 
of  events. 

The  system  combined  a high-speed  analog-to-digital  con- 
verter, an  Incremental  digital  tape  recorder,  and  a second- 
generation  microcomputer  In  a compact,  lightweight  package. 
Requiring  only  external  electrical  power,  the  system  was 
capable  of  operation  In  an  Infinite  number  of  modes.  The 
completed  product  cost  less  than  three  thousand  dollars  and 
has  been  used  to  replace  analog  tape  recorders  costing  five 
times  that  much  In  cases  where  the  data  resolution  could  be 
handled  by  MIDAS . 


II . SYSTEM  DESCRIPTION 
A.  PRIMARY  COMPONENTS 

MIDAS  used  three  commercially  available  units  from  original 
equipment  manufacturers,  each  of  which  made  a large  contribu- 
tion to  the  entire  MIDAS  system.  The  three  units  were:  a 
DATEL,  Inc.,  model  DAS-16  multiplexed  modular  data  acquisition 
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system;  a MEMODYNE,  Inc.,  model  BR-103  incremental  digital 
cassette  tape  recorder;  and  the  heart  of  the  system,  a PRO-LOG, 
Inc.,  model  MPS-805  microcomputer. 

1 . Modular  Da ta  Acqulsi tlon  System 

The  DATEL  DAS-16  Incorporated  a series  pair  of,  eight- 
channel  analog  multiplexers,  a sample-hold  unit,  a high-speed 
analog- to-dlgl tal  converter  (ADC),  and  a system  programmer 
module,  in  a 4.5  inch  by  5.0  inch  module  only  1.5  inches 
thick.  The  particular  series  of  DAS-16  used  in  MIDAS  was 
capable  of  25  Khz  throughput,  and  had  a resolution  capability 
of  4096  counts  over  a +5  volt  range. 

The  DAS-16  could  be  operated  in  either  random  address 
or  sequential  address  mode.  In  the  random  address  mode,  the 
DAS-16  was  forced  to  the  addressed  channel,  and  had  to  remain 
on  that  channel  until  the  address  command  was  changed.  In 
the  sequential  mode,  the  system  programmer  module  was  pulsed 
with  a CONVERT*  command,  and  it  in  turn  advanced  the  multi- 
plexer address  to  the  next  sequential  channel  when  the  con- 
version was  complete.  The  RESET*  command  was  used  to  drive 
the  system  programmer  to  output  the  address  for  channel  1 to 
the  multiplexer,  and  operated  in  either  address  mode.  Although 
provision  was  made  for  wiring  the  DAS-16  to  reset  upon  reach- 
ing a pre-defined  upper  channel  address,  the  capability  was 
not  used  in  MIDAS. 


264 


igig—rir  ■_  -r  n _ ^ 


2 • Incremental  Tape  Recorder 


The  MEMODYNE  model  BR-103  tape  recorder  was  configured 
to  use  the  industry  standard  "PHILLIPS"  digital  tape  cassette, 
a certified  computer-grade  tape  cassette.  In  the  MIDAS  sys- 
tem, this  cassette  was  capable  of  storing  over  1.7  megabits 
of  information.  Data  were  presented  to  the  recorder  in  bit- 
serial  format,  with  both  MOTOR  CLOCK  and  WRITE  CLOCK  waveforms 
generated  by  the  microcomputer  in  the  system.  The  recorder 
used  a two-track  recording  system  such  that  a flux  change 
on  one  track  signifies  a logical  "1",  while  a flux  change  on 
the  other  track  signifies  a logical  "0".  MEMODYNE  provides 
both  a "sync"  output  and  input  for  signaling  end  of  byte, 
although  the  sync  system  was  not  used  in  MIDAS.  There  were 
four  circuit  cards  used  in  the  tape  recorder,  three  of  which 
were  purchased  from  the  manufacturer,  and  one  which  was  manu- 
factured locally. 

a.  Motor  Drive  Card 

The  Motor  Drive  Card  received  a waveform  called 
"MOTOR  CLOCK,"  which  triggered  a one-shot.  The  output  of 
the  one-shot  triggered  a flip-flop  connected  to  the  motor 
drive  amplifier  circuitry,  such  that  the  resulting  output  to 
the  motor  was  a series  of  steps  applied  to  alternate  windings. 
The  MOTOR  CLOCK  waveform  was  generated  by  a series  of  trigger 
pulses  from  the  microcomputer  at  programmably  definable 
intervals.  The  logic  which  would  be  required  to  control  an 


external  oscillator  during  write  operations,  in  a dedicated 
hardware  system,  was  replaced  by  program  statements  executed 
by  the  microcomputer.  Although  the  card  had  provision  for 
operating  the  motor  in  reverse,  the  provision  was  not  used 
in  the  MIDAS  system. 

b.  Write  Step  Card 

The  Write  Step  Card  received  both  the  SERIAL 
DATA  and  the  WRITE  CLOCK  waveforms,  and  generated  the  pulses 
which  were  used  to  alter  the  magnetic  signature  on  the  tape. 
The  WRITE/READ*  line  was  also  used  in  this  card  to  determine 
whether  the  machine  was  in  the  WRITE  or  READ  status.  When 
a WRITE  CLOCK  pulse  was  received,  the  data  presented  was 
used  to  write  a flux  change  on  the  appropriate  track. 

c.  Read  Amplifier  Card 

The  Read  Amplifier  Card  sensed  the  flux  changes 
on  the  tape  during  READ  operations,  and  produced  two  signals 
used  by  MIDAS.  The  TAPE  CLOCK  and  TRACK  1 DATA  signals  repro- 
duced the  waveforms  presented  to  the  WRITE  CLOCK  and  SERIAL 
DATA  Inputs  on  the  Write  Step  card. 

d.  Read  Oscillator  Card 

The  Read  Oscillator  Card,  manufactured  locally, 
was  used  to  produce  a MOTOR  CLOCK  waveform  when  the  recorder 
was  used  in  the  READ  mode.  This  card  used  a SIGNETICS  NE-555 
Timer  Integrated  circuit,  which  could  be  adjusted  to  run  at 
frequencies  around  360  Hz,  the  recommended  motor  stepping 
frequency  during  READ  operations.  The  oscillator  was  gated 
ON  by  driving  the  READ*  line  to  logical  "0",  and  the  MOTOR 


CLOCK  line  to  logical  "1".  Since  the  recorder  was  never  in 
this  condition  in  the  WRITE  mode,  this  logical  state  implied 
that  the  system  microcomputer  was  directing  a READ  operation. 

The  necessity  for  the  separate  oscillator  card 
is  not  at  once  obvious,  since  the  microcomputer  can  generate 
the  required  waveforms  for  both  the  MOTOR  CLOCK  and  the  WRITE 
CLOCK  during  WRITE  operations.  It  must  be  remembered,  however, 
that  the  signals  on  the  tape  may  appear  in  a pseudo-random 
manner,  due  to  vibration  of  the  recorder  while  writing,  uneven 
tape  slack,  etc.  If  the  microcomputer  were  required  to  pro- 
duce the  MOTOR  CLOCK  waveforms,  it  is  possible  that  a TAPE 
CLOCK  pulse  might  pass  undetected  during  the  MOTOR  CLOCK 
pulse  generation  routine  sequence. 

3 . Microcomputer 

The  microcomputer  selection  for  MIDAS  was  determined 
by  the  microprocessor  used  in  the  microcomputer.  Although 
there  existed  a plentiful  number  of  discrete  microprocessor 
devices,  only  one  company  had  to  date,  supported  its  micro- 
processor with  a higher-level-language.  That  company  was 
INTEL,  Inc.,  which  provided  the  PL/1  - based  language,  PL/M, 
especially,  written  for  two  of  its  microprocessors,  the  8008 
(used  in  MIDAS),  and  a new  version,  the  8080.  Since  the 
question  of  the  microprocessor  had  been  decided,  it  was  then 
necessary  to  search  for  a manufacturer  who  included  the  8008 
or  the  8080  in  a microcomputer.  Although  INTEL  manufactured 
prototyping  systems  for  both  of  their  devices,  they  were  much 
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too  bulky  for  inclusion  in  a working  system.  Additionally, 
a rather  large  price  differential  existed  between  INTEL  and 
their  only  competition,  PRO-LOG.  Since  the  PRO-LOG  machine 
was  inexpensive,  lightweight  and  compact,  it  was  selected. 

The  MPS-805  microcomputer  used  five  card  modules: 
a Central  Processor  Unit  (CPU)  card,  a Read  Only  Memory  (ROM) 
card,  a Random  Access  Memory  (RAM)  card,  an  Input  card,  and 
an  Output  card . 

B.  SECONDARY  COMPONENTS 

Included  in  the  MIDAS  package  were  a digital  clock,  with 
time  readout  available  to  the  microcomputer  input,  and  a 
keyboard  which  was  used  to  enter  execution  instructions, 
parameters  requested  by  programs  currently  executing,  and 
new  program  instructions. 

A locally  manufactured  card,  the  Digital  Input  Select 
Card  (DISC),  was  used  to  multiplex  the  microcomputer's 
input  ports.  Also,  four  HEWLETT-PACKARD  integrated  circuit 
hexadecimal  displays  were  installed  in  the  MIDAS  front  panel. 

1 . Digital  Clock 

The  digital  clock  used  in  MIDAS  was  MM-5313  inte- 
grated circuit.  The  clock  required  a 60  Hz  input,  and  pro- 
vided a multiplexed  binary  output  as  well  as  multiplexed 
outputs  for  driving  seven-segment  numeric  displays.  The  60 
Hz  signal  was  derived  from  the  microcomputer  quartz  crystal 
timebase  which  provides  timing  signals  to  the  microcomputer. 

A "hold"  was  provided  for  time  setting,  in  conjunction  with 


a "fast"  pushbutton  and  a "slow"  pushbutton,  fo’-  advancing 
the  displayed  time  to  the  proper  time  in  fast  or  slow  fashion. 
The  outputs  from  the  clock  were  sent  to  a DISC  card,  and 
could  be  read  into  the  microcomputer  on  command.  Six  seven- 
segment  bar  displays  displayed  hours,  minutes,  and  seconds 
on  the  front  panel. 

2 . Keyboard 

The  keyboard  originally  designed  for  a desk  calculator, 
was  modified  to  allow  it  to  output  all  the  hexadecimal  char- 
acters. The  modification  consisted  of  changing  the  special 
purpose  keys  for  mathematical  operations  to  the  keys  necessary 
to  output  the  characters  not  shared  between  the  decimal  and 
the  hexadecimal  counting  systems  (i.e..  A,  B,  C,  D,  E,  F) . 

The  multiplication  key  was  left  connected  as  a special  function 
key,  and  was  used  for  "word  entry",  signifying  that  a byte 
is  ready  for  input  to  the  microcomputer. 

3 . Digital  Input  Select  Card  (DISC) 

The  DISC  was  used  for  multiplexing  up  to  four  inputs 
per  card  into  a common  input  port.  Using  "wired  and,"  or 
open  collector  output  NAND  gates  for  data  selection,  the  DISC 
received  an  address  (1-4),  and  output  data  presented  to  its 
corresponding  input  port. 

Any  number  of  DISCs  could  be  bussed  into  the  same 
Input  port,  as  long  as  the  DISC  which  was  active  was  not  in 
oapetitlon  with  another  DISC  on  the  same  bus. 
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4 . Hexadecimal  Display 


Four  HEWLETT-PACKARD  hexadecimal  displays,  mounted  In 
the  MIDAS  front  panel,  were  used  for  verification  of  memory 
content,  operator  prompting,  etc.  The  displays  were  configured 
as  five  columns  of  seven  rows  of  1 1 gh t- eml 1 1 Ing-d lodes  . 


Ill . SYSTEM  OPERATION 

A.  GENERAL 

As  emphasized  before,  MIDAS  operated  as  a software  system. 
The  mesh  between  understanding  the  software  and  operating  the 
system  Is  so  tightly  woven  that  even  the  barest  understanding 
of  MIDAS  operation  depends  on  understanding  the  program  struc- 
ture. With  that  premise  In  mind,  software  modules  were  written 
and  de-bugged.  Since  these  modules  existed  In  subroutine  form. 
In  ROM,  they  could  be  linked  together  and  executed  with  a 
minimum  of  user-written  code. 

B.  MICROCOMPUTER  MEMORY  ORGANIZATION 

The  MPS-805  microcomputer's  memory  was  arranged  as  a set 
of  pages,  each  with  256  lines.  Each  line  was  a byte  of  eight 
bits,  which  could  be  represented  as  a pair  of  hexadecimal  char- 
acters. There  were  three  pages  of  ROM  and  twelve  pages  of 
RAM  Installed  In  MIDAS  for  general  data  collection.  Two  of 
the  general-purpose  registers,  H and  L,  were  used  for  "pointing" 
at  a particular  place  In  memory,  either  ROM  or  RAM.  Register 
H was  used  to  point  to  the  page  desired,  and  register  L points 
to  the  line  desired. 
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C.  SO.FTWARE  MODULES 

In  MIDAS,  all  of  the  software  modules  were  written  as  sub- 
routines. It  was  therefore  possible  to  link  one  or  more  of 
these  subroutines  into  a working  data  acquisition  package.  A 
listing  of  the  primary  subroutines,  together  with  a brief  descrip- 
tion of  each,  is  presented  below. 

1 . EXEC 

The  EXEC  routine  was  accessed  by  the  computer  at  any 
time  the  RESET  button  was  pressed,  or  whenever  power  was 
restored  to  the  system.  The  machine  code  for  a JUMP  UNCON- 
DITIONAL instruction  was  loaded  into  the  first  line  in  RAM; 
then,  using  subroutine  KEY,  the  keyboard  was  read.  The  user 
keyed  in  the  line  number,  and  then  the  page  number  of  the 
location  to  which  he  wished  to  jump.  The  JUMP  UNCONDITIONAL 
instruction  transferred  complete  control  to  the  program  located 
at  the  addressed  location. 

2 . KEY 

The  KEY  routine  read  the  keyboard  a character  at  a 
time.  When  a key  was  depressed,  after  KEY  had  been  called, 
the  value  indicated  on  the  key  was  transmitted  to  input  port  0. 

3 . TIM  1 

TIM  1 provided  a delay  in  increments  of  11.12  milli- 
seconds, used  in  the  KEY  routine  for  de-bouncing  the  keyboard 
switches.  The  number  of  delays  was  loaded  into  register  D, 
and  TIM  1 was  called. 


I 
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4 . RCDR 

RCDR  operated  the  cassette  recorder.  By  generating 
the  MOTOR  CLOCK  as  well  as  the  WRITE  CLOCK  waveforms,  the 
microcomputer  had  complete  control  over  the  recorder.  Since 
it  was  desirable  to  minimize  code  whenever  possible,  RCDR  was 
used  to  generate  "gaps"  where  only  the  erase  head  was  in 
operation,  and  no  flux  reversals  were  written  onto  the  tape, 
as  well  as  when  normal  write  operations  were  desired. 

5.  LGAP 

LGAP  used  RCDR  to  write  a long  gap  on  the  tape.  LGAP 
was  used  to  erase  a markedly  long  section  of  tape,  or  to 
make  a long  leader  at  the  beginning  of  a new  cassette. 

6 . SGAP 

SGAP  wrote  a gap  equivalent  to  four  eight-bit  bytes, 
and  was  used  to  place  the  tape  in  motion  prior  to  writing  on 
the  tape. 

7 . HDR 

HDR  used  routines  KEY  and  CLK  to  produce  a "standard" 
header  of  length  10  bytes,  beginning  at  the  first  line  of  RAM. 
Using  routine  CLK,  the  hours  and  minutes  output  from  the 
clock  were  loaded  and  the  routine  then  called  KEY  to  load 
values  into  locations. 

8 . DEMO 

DEMO  is  a demonstration  of  module  linkage.  This 
routine  called  LGAP,  HDR,  DGTR , and  returned  to  HDR  when  the 
first  data  acquisition/record  sequence  was  complete. 
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9 . 

• DGTR  represented  the  core  of  the  software.  Using 
information  placed  into  RAM  by  HDR,  DGTR  selected  analog 
signals  by  sequencing  through  the  DAS-16  and  stored  their 
digitized  equivalents  in  RAM  until  a variable  (user-defined) 
length  buffer  was  filled.  When  the  buffer  was  filled,  DGTR 
called  RCDR  to  record  the  buffer.  If  there  were  remaining 
data  points  to  be  taken,  DGTR  continued  until  the  number  of 
pages  of  data  points  specificed  by  the  user  had  been  satisfied. 

10 , CLBTR 

CLBTR  was  used  to  calibrate  the  DAS-16,  to  examine 
a relatively  steady-state  analog  signal,  or  to  set  the  zero 
and  full-scale  points  on  an  input,  A "99"  appeared  in  dis- 
play 0 when  CLBTR  was  called,  and  the  user  entered  the  hexa- 
decimal code  for  the  channel  he  wished  to  examine. 

11 . READR 

READR  was  used  to  read  from  the  tape  cassette,  and 
could  be  used  for  either  data  retrieval  or  program  entry 
into  the  machine.  The  user  keyed  in  a line  number  and  page 
number  where  the  tape  information  was  to  be  stored.  The 
user  then  keyed  in  the  total  number  of  bytes  to  be  read  in 
(up  to  256),  in  hexadecimal;  READR  then  read  that  many  bytes 
into  the  memory. 
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12.  AUDR 


AUDR  was  a dual-purpose  routine,  used  for  both  loading 
RAM  and  examining  ROM  or  RAM.  After  accessing  the  routine, 
the  user  keyed  a "1"  for  loading,  or  a "0"  for  examining, 
followed  by  the  line  and  page  number  of  the  code  he  wished  to 
load  or  audit.  Port  0 display  showed  the  line  number  to  be 
operated  on,  in  load,  or  the  line  number  of  the  data  presen- 
ted in  port  1.  In  examine  mode,  the  KEY  routine  was  used 
only  as  an  event  monitor.  In  the  load  mode,  the  KEY  routine 
actually  loaded  RAM. 

D.  OPERATING  INSTRUCTIONS 

After  applying  power,  the  user  observes  both  display 
ports  showing  "00".  This  condition  exists  at  any  time  the 
machine  is  reset  or  powered  up.  The  machine  is  executing 
EXEC,  waiting  for  a two-part  address  (line  and  page  number) 
to  which  to  jump.  The  user  then  keys  in  the  line  number, 
followed  by  "word  entry,"  and  the  page  number,  followed  by 
"word  entry."  If,  for  example,  it  is  desired  to  use  DEMO, 
the  user  will  key  "FO,"  "word  entry,"  "word  entry"  (the 
"word  entry"  key  by  itself  is  an  implied  zero).  If  a tape 
is  loaded,  and  the  head  is  down,  the  tape  will  begin  to  move, 
and  port  0 will  count  down  from  "07"  to  "00."  Almost  immed- 
iately, a "01"  will  appear  in  port  0,  prompting  the  month. 

The  user  then  keys  in  the  number  corresponding  to  the  month, 
followed  by  "word  entry."  The  display  will  then  prompt  "02," 
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calling  for  the  day,  which  should  be  keyed  in,  and  then  the 
display  will  prompt  "03,"  which  should  be  answered  with  the 
year.  The  day,  month,  and  year  will  be  stored  as  pairs  of 
binary-coded-decimal  numbers,  rather  than  as  hexadecimal^ 

% 

numbers.  This  is  the  only  information  keyed  in  as  decimal 
numbers.  The  "04"  prompt  calls  for  the  total  number  of  pages 
(256  data  points)  to  be  taken.  With  11  pages  of  RAM  Installed, 
the  user's  answer  may  be  in  the  range  of  1 to  B (remember, 
hexadecimal  input).  When  prompting  "05",  the  machine  is 
asking  for  the  highest  channel  number  to  be  Included  in  the 

I 

scan.  The  range  of  answers  is  1 to  10  (hexadecimal).  The 
display  will  prompt  "06"  for  the  inter-cycle  delay  desired; 
this  may  be  in  the  range  of  0 to  FF,  and  will  produce  delays 
of  0 to  2.55  seconds,  respectively.  The  prompts  "07"  and 
"08"  may  be  Ignored,  or  used  for  extra  correlation  information 
as  desired.  When  the  "word  entry"  key  is  released  after 
the  "08"  prompt,  HDR  will  record  the  information  read  from 
the  clock  and  keyed  in,  and  the  DEMO  routine  will  then  call 
DGTR  to  acquire  the  data  requested. 

When  DGTR  has  finished  recording  the  data  on  tape,  it  will 
return  to  DEMO,  which  then  calls  HDR  again.  This  sequence 
will  be  repeated  indefinitely  until  the  machine  is  reset. 
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V.  SUMMARY  AND  CONCLUSIONS 

MIDAS  combines  a high-speed  multiplexed  data  acquisition  j 

system  module,  an  Incremental  digital  tape  cassette  recorder,  | 

and  a second-generation  microcomputer  in  a compact,  light-  | 

’I 

weight  package.  Requiring  only  external  electrical  power,  j 

the  system  is  capable  of  completely  unattended  operation,  j 

j 

and  may  be  programmed  to  operate  in  literally  an  infinite 
number  of  modes. 

i 

The  results  obtained  using  MIDAS  to  date  have  been 
encouraging,  resulting  in  the  production  of  three  other  hand- 
built  systems.  The  prototype  for  a dedicated  airborne  system 
is  in  its  initial  stage.  Used  to  monitor  the  fatigue-critical 
points  in  an  airframe  structure,  the  device  will  be  extremely 
compact  and,  if  put  into  production,  may  cost  less  than  two 
thousand  dollars,  even  using  militarized  parts. 
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Abstract 

This  paper  describes  a flexible  disk  controller  design  based  on  a 
microprocessor.  The  controller  provides  a very  simple  time- independent 
interface  to  the  host  computer.  The  disk  with  controller  can  be  readily 
connected  to  many  different  hosts  because  of  the  simplicity  of  the  inter- 
face and  the  portable  software  resident  in  the  controller  microprocessor. 

Introduction 

In  this  paper  we  describe  a floppy  disk  controller  designed  for  a 
computer  science  laboratory  containing  several  kinds  of  computers. 

One  objective  of  the  controller  design  was  to  choose  a single  host/ 
controller  interface  suitable  for  all  computers.  In  the  computer  science 
laboratory,  student  projects  Include  the  design  and  Implementation  of 
operating  systems  and  file-management  systems.  Therefore,  another 
objective  of  the  disk  controller  was  to  present  a logical  interface 
to  the  host  computer  that  did  not  preclude  experimentation  with  file  and 
directory  structures.  Since  the  host  computer  software  would  in  many  cases 
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be  in  the  development  stage,  it  was  desirable  that  the  disk  controller  be 
able  to  detect  as  many  errors  as  possible  in  the  usage  of  the  interface 
by  the  host  computer.  A machine- independent  floppy  disk  controller 
having  a RS-232C  (1)  standard  interface  (as  shown  in  Figure  1)  was  im- 
plemented. The  portability  of  the  RS-232C  interface  approach  over  a direct- 
to-bus  parallel  interface  is  well  known  and  was  felt  to  be  more  Important 
than  the  loss  of  speed  suffered  by  choosing  a serial  over  a parallel  Interface. 
More  Important  to  the  portability,  the  disk  controller  is  Itself  a microcomputer 
containing  its  own  disk  handler  software.  The  controller  provides  significant 
functional  capability  while  handling  all  the  troublesome  unique  properties 
of  the  disk — such  as  interrupts  and  fixed-size  sectors. 

In  addition  to  host-microcomputer  independence,  a number  of  secondary 
advantages  are  realized  from  the  intelligent-controller  approach  to  the  disk 
Interface. 

1.  The  host  microcomputer  software  requirements  are  reduced.  A 

significant  portion  of  the  operating  system  necessary  to  support 
the  disk  has  been  moved  into  the  controller  microcomputer,  thus 
freeing  scarce  memory  and  execution-time  overhead  in  the  host 
microcomputer.  The  controller  microprocessor  provides  multiple 
open  file  capability,  a full-sector  buffer  for  each  of  up  to 
four  open  files,  very  efficient  sequential  file  access  both 
forward  and  backward,  random  access  to  the  character  level, 
update-ln-place,  reading  and  writing  the  same  file,  diskette 
Initialization,  efficient  logical  sector  spacing  around  a 
track,  and  disk  error  recovery  by  retrying  operations. 
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2.  User  program  crashes  are  made  less  serious  by  the  inaccessibility 
of  the  disk  handler  code.  Neither  the  code  nor  the  data  in  the 
controller  microcomputer  can  be  directly  accessed  and  destroyed 
by  the  user  program. 

3.  Critical  time  dependency  is  removed  from  the  host  microcomputer. 

If  the  disk  were  interfaced  to  the  host  microcomputer  In  the  usual 
dlrect-to-bus  fashion,  the  host  may  be  saturated  during  disk 
transfers  by  frequent  Interrupts.  Interrupts  during  disk  transfers 
require  immediate  attention  since  characters  must  be  transferred 

at  a speed  fixed  by  the  rotational  speed  of  the  disk. 

4.  A simple  and  verifiable  interface  protocol  between  the  host  and 
the  controller  is  Implemented  in  contrast  to  the  very  complex 
hardware-oriented  Interface  of  the  host  software  with  a direct- 
to-bus  interface.  The  intelligent  controller  enforces  the  inter- 
face protocol  between  the  host  and  the  controller  giving  specific 
diagnostics  for  violations. 

Host/Controller  Communication 

Commands  and  data  are  passed  from  the  host  microcomputer  to  the 
controller  microcomputer.  The  controller  returns  data  and  status  to  the 
host.  Data,  commands,  and  status  are  all  transmitted  as  serial  asynchronous 
8-blt  quantities  across  the  RS-232C  interface.  Data  is  distinguished  from 
commands  and  status  using  the  most  significant  bit  of  each  8-bits,  as  shown 
In  Figure  2.  The  advantage  gained  is  that  significantly  fewer  bytes  need 
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FIGURE  2.  — Command,  status,  and  data  words. 
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be  transferred  between  the  host  microcomputer  and  the  controller  for  files 
consisting  of  ASCII  characters.  Alternative  means  of  distinguishing  data 
from  commands  and  status  involve  transmitting  counts  of  the  number  of  data 
bytes  transmitted  or  always  transmitting  special  end-of-data  characters 
and  always  transmitting  status  at  the  end  of  every  operation.  For  example, 
after  a read-character  operation  resulting  in  a normal  operation  both  a 
status  byte,  to  indicate  normal  termination  of  the  operation,  and  a data 
byte  would  need  to  be  returned.  However,  if  the  data  byte  is  self-identifying, 
status  would  need  to  be  returned  only  when  an  operation  terminated  abnormally. 

This  scheme  for  distinguishing  data  from  commands  or  status  has  the 
disadvantage  that  binary  files  cannot  be  transmitted.  It  was  assumed  that 
all  files  would  be  stored  as  ASCII  characters.  Although  this  is  wasteful  of 
space  for  object  files,  it  has  the  advantage  that  all  files  can  be  printed 
and  edited. 

The  software  in  the  disk  controller  is  a 1.25K  byte  program  that 
performs  the  commands  sent  to  it  by  the  host  computer,  buffers  data  in 
sector-size  blocks,  transfers  data  to  and  from  the  physical  disk,  presents  a 
simulated  Interface  of  four  random  or  sequentially  accessed  virtual  units  to 
the  host  computer,  and  performs  several  other  functions  that  can  be  inferred 
by  the  description  of  the  controller  commands  and  status  words  that  follows. 

Fundamental  to  the  communication  between  the  host  and  the  controller  is 
the  concept  of  a unit.  The  controller  recognizes  virtual  units  0 through  3. 
Each  virtual  unit  conceptually  has  a read/write  head  that  is  located  at  a 
"current"  character.  Physically  the  current  character  is  located  by  four 
parameters  associated  with  the  virtual  unit: 
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• the  drive  number  which  Identifies  which  of  up  to  four 
disk  drives  the  current  character  is  on, 

• the  track  number  which  identifies  which  of  77  tracks  on 
the  disk  the  current  character  is  in, 

• the  sector  number  which  identifies  which  of  26  sectors  on 
the  track  the  character  is  in,  and 

• a character  number  which  identifies  which  of  the  128 
characters  within  the  sector  is  the  current  character. 

Each  disk  is  viewed  as  an  array  of  characters  formed  by  ordering  the 
characters  within  each  sector  from  0 to  127,  the  sectors  within  a track 
from  0 to  25,  and  the  tracks  on  a disk  from  0 to  76.  The  array  of  characters 
on  the  disk  are  formed  by  laying  the  sectors  within  each  track  end-to-end  and 
the  tracks  on  the  disk  end-to-end,  as  shown  in  Figure  3.  Viewing  the  disk  as 
an  array  of  characters  simplifies  sequential  access  of  characters  on  a unit. 
Sequential  access  can  be  either  forward  or  backward  and  is  requested  via  the 
read/wrlte-last-character  and  the  read/wrlte-next-character  commands. 

A read  or  write  command  to  a virtual  unit  can  be  for  the  last  character, 
the  current  character,  or  the  next  character.  When  the  current  character  is 
read  or  written,  the  four  parameters  locating  the  current  character  do  not 
change.  However,  when  the  last  character  or  the  next  character  are  read  or 
%/rltten,  the  track  number,  the  sector  number,  and  the  character  number  are 
updated  prior  to  the  read  or  write  operation  so  that  the  read/write  head  of 
the  virtual  unit  is  moved  backward  or  forward,  respectively,  in  the  character 
array  on  the  disk,  as  defined  above. 


SECTOR  1 


SECTOR  25 


SECTOR  0 


SECTOR  1 


SECTOR  25 


Random  access  to  the  character  level  is  facilitated  by  commands  which 
allow  the  disk  number,  the  track  number,  the  sector  number,  or  the  character 
number  for  a unit  to  be  set  at  any  time.  By  setting  the  location  parameters 
prior  to  reading  or  writing  the  current  character,  data  anywhere  on  any  disk 
can  be  accessed.  Once  a location  has  been  set,  access  can  continue  in  a 
sequential  manner,  if  desired. 

Read  and  write  commands  on  a unit  can  be  mixed  and;  in  fact,  updates-in- 
place  are  readily  accomplished.  An  update-ln-place  Is  a read  operation 
followed  by  a write  operation  that  replaces  some  or  all  the  data  read.  This 
feature  is  very  useful  and  is  not  commonly  found  in  disk  controller 
implementations.  In  this  controller  it  is  a natural  result  of  the  random 
access  implementation. 

Besides  commands  to  read  and  write  data  and  change  the  location  of  a 
virtual  unit's  read/write  head,  there  are  a number  of  other  commands  for 
such  tasks  as:  closing  units.  Initializing  the  controller,  formatting  a 
disk,  and  requesting  status.  A complete  list  is  shown  in  Table  1. 

Each  virtual  unit  has  an  associated  status  word  that  is  returned  by 
the  read-unit-status  command.  The  status  word  contains  four  binary 
pieces  of  information:  not-ready,  write-protect,  w-flag,  and  r-flag. 

The  not-ready  flag  means  the  drive  associated  with  the  virtual  unit  does 
not  contain  a diskette  or  the  disk  door  is  not  closed.  The  write-protect 
flag  means  the  diskette  has  a write-protect  seal.  The  w-flag  indicates 
that  the  copy  of  the  data  in  the  buffer  differs  from  the  data  on  the  disk; 
therefore,  the  sector  must  be  written  to  the  disk.  The  r-flag  means  that 
a copy  of  the  disk  sector  has  been  placed  in  the  buffer. 
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TABLE  1 


DISK  C(»n:ROLLER  COMMANDS 


u Is  unit  number  In  command, 

d is  data  word  which  must  follow  command,  or  data  word  returned 
as  a result  of  connand. 
s Is  status  returned  by  controller 


Command,  parameters 


Controller 

Returns 


Description 


read  drive,  u 


The  drive  number  associated  with  unit 
u is  returned  as  data  to  the  host . 


read  track,  u 


The  track  number  associated  with  unit 
u is  returned  as  data  to  the  host. 


read  sector,  u 


The  sector  number  associated  with  unit 
u Is  returned  as  data  to  the  host. 


read  byte,  u 


The  character  number  associated  with 
unit  u is  returned  as  data  to  the  host. 


set  drive,  u,  d 


The  drive  number  associated  with  unit 
u Is  set  to  d if  d is  valid.  S Is  "OK” 
or  "invalid  data."  Any  data  in  the 
buffer  is  written  to  the  disk. 


set  track,  u,  d 


The  track  number  associated  with  unit  u 
is  set  to  d if  d Is  valid.  S Is  "OK" 
or  "Invalid  data."  Any  data  in  the 
buffer  Is  written  to  the  disk. 


set  sector,  u,  d 


set  byte,  u,  d 


The  sector  number  associated  with  unit 
u Is  set  to  d if  d is  valid.  S Is  "OK" 
or  "Invalid  data."  Any  data  in  the 
buffer  Is  written  to  the  disk. 

The  character  number  associated  with 
unit  u Is  set  to  d if  d Is  valid.  S 
Is  "OK"  or  "invalid  data." 


V 

V 


Comnand,  parameters 


Controller 

Returns 


read  previous  character,  u I d or  s 


read  current  character,  u d or  s 


read  next  character,  u 


d or  8 


write  previous 
character,  u,  d 


write  current 
character  u,  d 


write  next  character,  u,  d 


close,  u 


reset  controller 


restore  disk 


The  virtual  unit  us  location  parameters 
are  moved  backward  one  character  and 
that  character  is  returned  as  d.  If 
for  any  reason  the  character  cannot 
be  read,  s is  returned  to  indicate  the 
problem. 

The  character  defined  by  virtual 
unit  u's  location  parameters  is 
returned  as  d . If  for  any  reason  the 
character  cannot  be  read,  s is  returned 
to  indicate  the  problem. 

The  virtual  unit  u's  location  parameters 
are  moved  forward  one  character  and  that 
character  is  returned  as  d.  If  for  any 
reason  the  character  cannot  be  read,  s 
is  returned  to  indicate  the  problem. 

The  virtual  unit  u’s  location  parameters 
are  moved  backward  one  character  av(d  d 
is  written  at  the  new  location,  s is 
returned  as  "OK"  or,  if  d cannot  be 
written,  s Identifies  the  problem. 

The  character  defined  by  virtual  unit  u's 
location  parameters  is  replaced  by  d. 
s is  returned  as  "OK"  or,  if  d cannot 
be  written,  s identifies  the  problem. 

The  virtual  unit  u's  location  parameters 
are  moved  forward  one  character  and  d is 
written  at  the  new  location,  s is  re- 
turned as  "OK"  or,  if  d cannot  be  written. 
8 Indicates  the  problem. 

The  virtual  unit  u is  closed  by  writing 
any  data  in  its  buffer  to  the  disk,  s 
is  returned  as  "OK"  or,  if  the  buffer 
cannot  be  written,  s Indicates  the 
problem. 

All  units  are  initialized  to  disk  0, 
track  0,  sector  0,  and  byte  0.  The  buffer 
associated  with  each  unit  is  initialized 
to  empty.  All  disk  drive  heads  are 
returned  to  track  0. 

The  disk  drive  head  of  virtual  unit  u is 
returned  to  track  0. 


After*  each  command  from  the  host  microcomputer  to  the  disk  controller 
either  data  Is  returned.  Indicating  "OK"  status,  or  a status  byte  Is 
returned  which  Indicates  either  the  normal  termination  of  the  command  or 
the  reason  why  the  command  could  not  be  completed.  Table  2 describes  all 
status  returns. 


Conclusion 

The  floppy  disk  controller  described  here  has  been  Implemented  using 
the  Intel  8080  microprocessor  as  the  base  of  the  controller  microcomputer 
and  the  PerScl  Flexible  Disk  Drive.  This  floppy  disk  subsystem  has  been 
connected  to  a SVrrF-6800  microcomputer  system  for  use  as  the  primary  mass 
storage  for  a personal  computer  operating  system.  The  interface  to  the 
disk  controller  described  here  has  been  found  to  be  easy  to  use  and  to  have 
all  the  advantages  discussed  earlier.  In  addition,  the  partitioning  of 
the  usual  operating  system  software  which  provides  support  for  a disk  Into 
two  pieces,  the  disk  support  In  the  disk  controller  microcomputer  and  the 
user  Interface  in  the  host  microcomputer,  made  the  total  system  significantly 
simpler  to  Implement.  This  simplicity  was  gained  by  Isolating  the  trouble- 
some disk-dependent  considerations  from  the  rest  of  the  operating  system. 

Probably  the  major  innovation  of  this  approach  is  the  portability  of 
the  device  support  software.  The  difficulties  in  creating  portable  software 
are  well  known.  However,  the  Inexpensive  microprocessor  has  provided  us 
with  a perhaps-too-obvious  solution  to  the  portability  problem.  In  order  to 
create  a portable  software  package,  simply  dedicate  a computer  to  the  soft- 
ware. What  would  at  one  time  have  been  a ridiculous  Idea  Is  now,  because  of 
the  low  cost  of  microprocessors,  the  most  economical,  reasonable,  and  straight 


forward  solution. 


Status 


Drive  not  ready 


Write  protected 


Invalid  command 


Invalid  data 


Seek  error 


Read /write  error 


Disk  underflow 


Disk  overflow 


TABLE  2. 

STATUS  RETURNS  FROM  DISK  CONTROLLER 


Description 


The  previous  operation  was  completed  normally. 

The  drive  on  which  the  previous  operation  was 
attempted  does  not  contain  a diskette  or  the  disk 
door  is  not  closed. 

The  preceeding  operation  was  an  attempt  to  write 
on  a diskette  on  which  there  was  a write-protect 
seal. 

The  previous  command  code  is  not  a command  recognized 
by  the  disk  controller  or  a data  byte  was  received 
when  a command  was  expected. 

A command  byte  was  received  when  a data  byte  was 
expected. 

The  seek  caused  by  the  previous  operation 
resulted  in  the  disk  head  stopping  at  the  wrong 
track. 

The  read  or  write  caused  by  the  previous  operation 
terminated  due  to  a media  error  on  the  diskette. 

The  previous  operation  caused  the  location 
parameters  to  be  decremented  beyond  the  beginning 
of  track  0. 

The  previous  operation  caused  the  location  parameters 
to  be  incremented  beyond  the  end  of  track  76. 


y 
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Abstrac  t 


The  microprocessor  provides  a cost  effective  tool  in  that 
a designer  can  make  use  of  the  software  fle«ibility  to 
oefine  and  aid  in  hardware  design*  One  example  is  the 
design  of  an  FFT/DFT  processor.  Furthermoret  as  the 
Dramatic  decrease  in  both  processor  and  memory  costs  and  the 
increase  of  microprocessor  speed  continue*  the  software 
FFT/DFT  package  may  become  more  attractive  in  some 
applications.  A one-dimensional  DFT  algorithm  implemented 
on  an  lNTEL-"060  can  provide*  for  256  real  inputs*  one 
complex  output  every  261)  milliseconds  with  six-bit 
precision.  The  entire  system  costs  only  six  hundred 
Dollars. 
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iMIUEUtaattttfissdaasi 


1 nt  roduc  t i on 


The  basic  discrete  fourier  transform  (DFT)  is  defined  by: 

ig-(  -21Crki/N 

Ar  = ^ Ok  * e <1 ) 

1/2 

where  j=(-1) 

Ar  is  the  r-th  coefficient  of  the  DFT  and  Bk  denotes  the 
k-th  sample  of  the  time  series  which  consists  of  N samples* 
The  procedure  for  calculating  Ar  is:  rotate  the  vector  Bk  by 
the  angle  2lCrk/N  and  form  the  summation  over  k.  This  vector 
rotation  is  generally  performed  by  four  real  multiplications 
and  two  additions*  Golub  [3]  was  able  to  accomplish  the 
same  result  by  using  only  three  real  multiplications  and 
five  additions*  with  minor  modification  to  the 
trigonometric  tablet  Uuneman  143  suggested  a way  to  do  it 
with  three  real  multiplications  and  three  adoitions* 
Oespain  11*23  came  out  with  two  very  interesting  approaches* 
One  of  them  123  is  the  famous  COROIC  method  which  takes  both 
real  and  imaginary  inputs  and  employs  only  one 
multiplication*  The  other  one  C13  takes  only  real  inputs* 


but  uses  a 

very 

simple  technique 

and 

does  not  use 

mul tipi  icat ion 

at 

all*  At  one  end 

of 

performance/cost 

tradeoff*  if  high  throughput  and  precision  are  not  required* 
the  transform  can  even  be  done  by  an  extremely  inexpensive 
hardware  device*  say  a microprocessor*  A oui ck-and-d irt y 
one~d i m ens iona  I DFT  routine  has  been  programmed  and  tested 
on  an  INTELLEC-8  microcomputer  at  the  Moore  School  of 
Electrical  Engineering*  University  of  Pennsylvania*  The 
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purpose  o1  this  paper  is  to  examine  the  cost*  performancet 


and  precision  of  this  package. 


Implement^  t ion 

Equation  (1)  can  be  rewritten  as: 

p-t  fj-l 

Ar=  ZZ  Rk*C0S(2ilrk/N)  •»  j ZZRk*SIN(2H;rk/N)  (2) 

If  all  the  inputs  are  real  and  normaliied  within  (-1f  ♦I)* 
Ok  then  can  be  represented  by: 

Bk  = SIN(Zk)  (3) 

where  -TT/?  <=  2k  <=  7C/<; , Hence, 


Zk  = SIN'*(Bk)  (A). 


Substituting  (3)  into  (2),  we  obtain: 


w-t  N~l 

Ar  = ^ SIN(Zk)*C0S(2a:rk/N)  ♦ i J S I N ( 2 k ) * SI  N (2/rrk  / N ) 

= 1/?  2II  [SIN(2ke2TCrk/N)  ♦ SI  N ( Z k-2JCr  k /N  ) 1 

fi-fl 

♦ j /2  ^ tCOSIZk-PJLrk /N)  -C  OS  ( Z k*  27lr  k /N  ) 3 (5) 

k*o 


The  last  derivation  results  from  the  identities 


1 


SIN (a4b)*SlN(a-b)=2SlN(a>C0S(b) 


and 


COS (a*b)-COS (a-b)= 


-2S IN ( a )S I N (b  ) . It  is  not  unusual  to  restrict  N (the  number 
of  input  points)  to  be  a power  of  two*  Jf  so,  2XrktH  can  be 
obtained  by  shifting.  It  is  now  obvious  tha*  only 
addi t ions / sub t r act i ons » shifts*  and  table  look-ups  are 
required*  and  multiplication  is  totally  eliminated. 

The  program  to  calculate  the  DFT  using  this  equation 
required  700  words  (8  bits/word)  and  the  associated  tables 
needed  ^*300  words.  The  hardware  to  perform  this  OFT 
consists  of  a processing  unit  based  on  the  80ti0 
microprocessor*  3*000  words  of  read-only  memory  for  the 
program  and  the  tables*  and  600  words  of  read-write  memory 
for  input  buffers*  In  addition*  we  need  the  capability  of 
Direct  nemory  Access  (DMA).  A typical  commercial  product 
fulfiling  the  above  requirements  is  the  INTtL  SBC  80/20 
single  board  computer.  The  whole  system  costs*  including 
the  power  supply  and  the  chasis*  around  six  hundred  dollars* 
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Performance 


After  steady  state*  while  one  set  of  data  Isay*  MBX)  is 
processed*  the  other  previous  calculated  data  (hBY)  fs 
output  to  1/0  device.  At  the  same  time*  another  set  of  raw 
input  is  put  in  N&Y  via  the  DMA  function.  we  can  simply 
interchange  the  roles  of  MbX  and  MBY  when  one  set  of  data 
has  been  completely  processed.  In  this  way*  the  1/0  time  is 
negligible  compared  to  the  DFT  processing  time.  Since  this 
is  a OFT  algorithm*  each  output  point  is  produced 
successively.  From  equation  (5)*  we  know  that  the 
processing  time  for  each  Ar  does  not  depend  on  that  for 
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other  outputs*  Fertoraance  is  therefore  greatly  improved  if 
multiple  processor  are  employed*  For  instance*  N 
microprocessors  can  be  used  to  generate  simultaneously  N 
outputs. 

The  throughput  rate  depends  almost  linearly  on  the  number 
of  inputs  (U).  Due  to  our  program  structure*  the  program 
responds  a little  slower  when  Zk-»^>?^rk/N  of  eauation  (b) 
falls  in  the  4-th  quadrant  and  faster  when  in  thr  1-st 
quadrant.  By  analysis  and  experiment*  we  have  obtained  the 
following  throughp-ut  rates: 


N : 2 : 4 : b : 16 

32 

64 

12b 

256 

T : 2 : 3.9  : 7.6  : 15.2 

31.1 

63.0 

127.7 

2 59.2 

B :500:263.2:131 .6:  65.8 

32.2 

15.9 

7.b 

5.9 

where  N is  the  number  of  inputs 

T is  the  time  for  producing  one  successive  output 
(milt iseconds) 

B is  the  processor  bandwidth  (hertr) 

Obviously*  the  maximum  sampling  rate  is  very  low.  For 
real  input  points*  our  package  generates  one  output  point 
every  259.2  milliseconds*  or  produces  output  points  in 

approximately  64  seconds*  Thus  our  package  constraints  us 
to  very  low  frequency  applications. 
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Prec i s ion 


Thp  input  and  output  are  quantized  in  256  levels  using  an 
eight-bit  Z's  complement  representation.  Because  the 
arguments  (Zk  and  2^rk/N)  oi  the  SIN/COS  functions  in  eq. 
(5)  may  vary  from  -Ji/2  to  2/L  (6.2832)«  we  need  8 bits  to 
indicate  fractional  part  of  the  argument.  and  a few  bits  to 
represent  integer  part.  For  simplicity.  we  use  another  8 
bits  for  integer  part.  The  partial  sums  of  SIN  ana  COS  in 
eouation  5 are  also  kept  in  two  16-bit  storages  in  order  to 
allow  a maximum  number  (256)  of  input  points  without 
overflow  on  accumulating  the  partial  sum.  Internal 
operations  such  as  add i t i on s /sub s t rac t i on s in  eq.  (5)  are 
therefore  handled  in  double  precision.  The  result?  Ar?  is 
normalized  ( i.e.  divided  by  N } before  it  is  output. 

We  have  run  through  test  samples  On  both  a large  computer 
(Spectra  70/A6).  under  the  HARM  Fortran  subroutine  yielding 
i2-bit  precision  outputs,  and  our  quic k-and-di r ty  package  on 
INTELLEC-8.  The  HARM  Fortran  subroutine  which  performs 
Discrete  Fourier  Transforms  on  a complex  three  dimensional 
array  is  in  the  IBM  Scientific  Subroutine  Package  (SSP). 
Inputs  to  these  two  programs  are  ranged  from  -128  to  127?  a 
representation  from  -1  to  *1  in  256  level.  The  significant 
8 bits  of  both  outputs  are  compared.  They  checked  correctly 
to  first  six  bits. 
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Cone lus i on 


For  real-time  applications,  we  can  buy  a hardwired  FfT 
processor  which  can  perform  1024  complex  inputs  within  5 
milliseconds*  Such  a processor  costs  around  twenty  to 
thirty  thousand  dollars*  In  some  off-line  applications, 
such  as  CRT  synthesis  or  enhancing  photographs  and  TV 
images,  sometimes  we  do  not  need  high  throughput  and 
accuracy*  For  such  applications,  this  low  cost  OFT  package 
may  become  attractive*  The  maximum  input  sampling  rate  for 
continous  processing  is  around  4 h;  for  256  points  at  a 
lime*  At  the  present  pace  of  progress  in  microprocessor, 
the  speed  of  the  OFT  will  quickly  increase*  Furthermore, 
instead  of  the  INTEL  8080  (noS)  chip,  if  we  use  bipolar 
(TTl)  or  even  FCL  processors,  the  speed  gain  will  be  20  to 
100  times  higher*  In  order  to  get  even  higher  performance, 
pipelined  or  parallel-processing  microprocessor 
architectures  may  be  used* 
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Abstract 

Towards  computer-based  research  and 
education  in  the  laboratory,  we  have 
developed  an  optically  linked  laboratory 
oriented  computer  network  LABOLINK.  It 
consists  of  a minicomputer  PDP-1 1/^40  in 
our  laboratory  linking  with  a medium 
scale  computer  HITAC  8350  of  the 

Department  of  Information  Science  and  a 
large  scale  computer  FACOM  230-75  of  the 
Data  Processing  Center,  Kyoto 

University.  We  have  developed  a 
■ieroprocessor-based  terminal  switching 
system  in  order  to  use  any  computer  in 
the  LABOLINK  from  any  terminal  located 
within  our  laboratory.  Tedious 

modification  of  the  operating  system  of 
the  PDP-11/440  was  avoided  through  the 
Introduction  of  a microprocessor.  The 
main  objectives  of  the  terminal 

switching  system  are  as  follows. 

(1)  Any  remote  terminal  can  be  used  as 
the  console  terminal  of  the  PDP-11/440. 

(2)  Any  computers  in  LABOLINK  can  be 
accessed  from  any  terminals. 

(3)  Terminal-specification  differences 
are  compensated. 

(4)  Terminal  users  can  communicate  each 
other  by  a conversational  mode. 

(5)  Several  terminals  can  be  controlled 
simultaneously  by  a high-level  language 
program  (FORTRAN  or  BASIC)  of  the 
PDP-  1 1/440  without  modifying  input/output 
statements . 

(6)  New  terminals  can  be  installed 
without  adding  device  drivers. 

(7)  If  more  than  one  computer  is 
connected  to  the  terminal  switching 
system,  communication  between  computers 
can  be  realized. 

By  using  Multi-User  BASIC  and 
foreground-background  capability  of  the 
PDP-1 1's  RT-11  monitor,  simultaneous 

use  of  terminals  is  possible.  For 
example,  two  users  can  access  the 
HITAC  8350  and  the  FACOM  230-75 
Independently  as  foreground  Multi-User 
BASIC  Jobs  while  another  user  edits 
programs  as  a background  Job,  since 
network  control  programs  of  the 
PDP-  1 1/440  are  written  in  Multi-User 
BASIC  (for  this  purpose,  original  PDP's 
Multi-User  BASIC  is  extended). 


The  microprocessor-based  terminal 
switching  system  discussed  in  this  paper 
can  be  used  in  other  computer  systems 
with  little  modification. 

1.  INTRODUCTION 

We  have  developed  an  optically  linked 
laboratory  oriented  computer  network 
LABOLINK  to  be  used  in  computer-based 
research  and  education  in  an  environment 
of  a university  laboratory. 

The  LABOLINK  consists  of  a minicomputer 
(PDP-11/410)  in  our  laboratory  linking 
with  a medium  scale  computer 
(HITAC  8350)  of  the  Department  of 
Information  Science  and  a large  scale 
computer  (FACOM  230-75)  (to  be  replaced 
by  a FACOM  M190  which  is  equivalent  to 
the  Amdahl  computer)  of  the  Data 
Processing  Center,  Kyoto  University. 

It  is  desirable  that  any  computer  in 
LABOLINK  can  be  used  simultaneously  by 
any  terminal  distributed  within  our 
laboratory.  Usually  this  is  achieved 
by  a modification  of  the  operating 
system  which  is  known  as  a tedious 
work.  We  have  developed  a 
microprocessor-based  system  for  this 
purpose,  since  by  this  approach  we  only 
need  to  treat  a microprocessor  system 
instead  of  a minicomputer  and  the  system 
also  can  be  used  in  other  computer 
system  with  different  operating  system 
with  little  modification.  Multi-User 
BASIC  of  the  PDP-  1 1 /440  is  also  modified 
to  handle  interrupt  signals  and  device 
status  registers,  which  is.  used  to 
write  network  control  programs  as  well 
as  application  programs  such  as  a 
manuscript  preparation  system.  Using 
the  microprocessor-based  terminal 
switching  system,  extended  Multi-User 
BASIC  and  RT-11  monitor  with 
foreground-background  capability , 
simultaneous  realization  of  connections 
between  computers  and  terminals  is 
possible.  For  example,  two  users  can  use 
the  HITAC  8350  and  the  FACOM  230-75 
independently  as  foreground  Multi-User 
BASIC  Jobs  while  another  user  edits  his 
paper  by  the  manuscript  preparation 
system  as  a background  Job. 


The  main  objectives  of  the 
microprocessor-based  terminal  switching 
system  are  sumraalized  as  follows. 

(1)  Any  terminal  can  be  used  as  the 
console  terminal  for  the  PDP-1 1/^10. 
That  is,  from  any  terminal,  monitors 
(DOS  or  RT-11),  language  processors 
(FORTRAN  or  BASIC)  and  utility  programs 
can  be  loaded. 

(2)  It  is  possible  to  control  the 
microprocessor  by  programs  written  in  a 
high-level  language  such  as  Extended 
Multi-User  BASIC  in  order  to  simplify 
the  program  development. 

(3)  Any  terminal  can  be  used  as  a 
terminal  device  of  any  program  run  on 
the  PDP-1 1/1)0.  Thus  any  computer  of 
LABOLINK  can  be  accessed  from  any 
terminal.  Simultaneous  usage  of 
terminals  is  possible  since  network 
control  programs  are  written  in  Extended 
Multi-User  BASIC. 

(4)  Terminal-specification  differences 

are  compensated.  Examples  are  as 

follows.  Some  terminals  only  accept 
capital  letters  while  others  accept  both 
upper  and  lower  case  letters.  Some 
terminals  realize  a composite  operation 
of  carriage  return  (CR)  and  line  feed 
(LF)  when  they  receive  a single  control 
code  LF  or  CR. 

(5)  Terminal  users  can  communicate  each 
other  by  a conversational  mode.  This 
function  is  useful  when  message 
transmission  is  necessary  between 
terfflln.'ls,  for  example,  "when  do  you 
finish  your  work?",  "how  can  I use  your 
program?"  etc. 

(6)  Simultaneous  control  of  several 
terminals  by  a high-level  language 
program  is  possible.  A game  program 
which  uses  several  terminals  is  easily 
implemented.  This  function  is  especially 
Important  in  cases  such  as  the 
manuscript  preparation  system  which 
requires  two  typewriters  , one  high 
resolution  printer  (Diablo  HyType  I)  for 
final  manuscript  printing  and  the  other 
for  control  and  error  messages. 

(7)  New  terminals  can  be  installed 
without  adding  device  drivers  and  they 
are  also  controlled  by  previously 
developed  software. 

Further  functions  to  be  added  to  the 
system  are  as  follows. 

(6)  Communication  between  computers  can 
be  realized  when  more  than  one  computer 
la  connected  to  the  terminal  switching 
system.  Communication  between  two 

independent  programs  running 

simultaneously  on  the  PDP-11/<40  is  also 
possible. 

(9)  Both  foreground  and  background 
monitor  messages  are  printed  on  the 
console  terminal,  which  are 


distinguished  by  the  heading  "F"  or 
"R".  This  two  kinds  of  messages  should 
be  distributed  to  two  different 
terminals . 

(10)  Further  compensation  of  terminal 
specification  differences  is 

necessary.  For  example,  code 

conversion  between  ASCII  and  EBCDIC  is 
necessary  when  we  have  to  use  EBCDIC 
terminals.  In  order  to  achieve  these 
objectives,  the  terminal  switching 
system  has  four  modes  for  each  terminal 
and  computer  interface. 

2.  OUTLINE  OF  THE  LABOLINK 

Although  the  LABOLINK  is  deslned  as  a 
typical  laboratory  oriented  computer 
network,  it  has  several  characteristic 
features  in  both  its  software  and 
hardware.  Configuration  of  LABOLINK  is 
shown  in  Fig.1.  A minicomputer 

PDP-11/))0  located  at  our  laboratory  is 
linked  with  a medium  scale  computer 
HITAC  8350  of  the  Department  of 
Information  Science  and  a large  scale 
computer  FACOH  230-75  of  the  Data 
Processing  Center,  Kyoto  University. 

An  optical  fiber  cable  with  line  speed 
of  1 Mbps  links  the  PDP  with  the 
HITAC.  We  have  linked  the  PDP  with  the 
FACOM,  which  is  to  be  a host  of 
Japanese  inter-university  computer 
network  partly  under  development,  by  a 
1200  bps  TSS  line.  An  optical  fiber 
cable  was  installed  to  establish  broad 
band  communication  lines  among  research 
rooms.  The  optical  cable  itself  can 

transmit  signals  at  100  Mbps,  but  is 
currently  at  4 Mbps  due  to  the 
restriction  of  photo-electric 

converters.  A single  line 

multi-connection  (SLMC  for  short)  is 
developed  to  use  a broad  band  optical 
fiber  line  as  many  communication 

lines.  The  current  version  of  SLMC  can 

handle  at  most  16  communication 

channels.  It  automatically  determines 
the  number  of  active  channels  and  varies 
its  speed  according  to  it.  That  is,  it 
can  be  used  as  a very  high  speed  line  as 
well  as  several  lower  speed  lines 
according  to  the  number  of  active 
channels.  SLMC  employs  high  efficient 
self  synchronizing  transmission  code, 
and  its  maximum  possible  transmission 
rate  is  4 Mbps.  As  a suitable  model  of 
computer  communication  interfaces  a new 
class  of  automata  called  two-l/O-pair 
automata  is  Introduced.  It  has  two 
Independent  input-output  pairs.  As  in 
the  case  of  actual  interfaces,  a machine 
connected  one  itiput-output  pair  is 
unable  to  know  the  input-output 
sequences  realized  at  the  other 
input-output  pair  directly,  although  the 
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output  of  the  former  pair  may  contain 
some  information  about  the  latter.  This 
property  causes  specific  problems  which 
need  not  be  considered  in  conventional 
automata.  Using  this  model  automata 
theoretic  approach  for  designing 
computer  interfaces  is  realized(for 
example,  reduction  of  communication 
symbols,  reduction  of  internal  states. 
Increase  of  reliability  can  be 
achieved).  We  have  designed  computer 
interfaces  between  the  PDP  and  the  HITAC 
by  this  approach,  which  have  higher 
error  detecting  capability. 

Communication  software  of  the  HITAC  is 
written  in  assembly  language  while 
communication  software  of  the  PDP  is 
written  in  Extended  Multi-User  BASIC, 
which  is  a modification  of  RT-II 
Multi-User  BASIC  by  adding  commands  to 
handle  device  status  registers  and 
device  interrupt  signals.  Since  RT-11 
Multi-User  BASIC  can  be  used  by  up  to  6 
terminals,  simultaneous  use  of  the 
FACOM  and  the  HITAC  is  possible  as  well 
as  other  BASIC  programs.  File  transfer 
between  the  PDP  and  the  HITAC  can  be 
done  by  a simple  command  string 
according  to  the  RT-11  monitor  standard 
format.  As  the  HITAC  and  the  PDP  use 
EBCDIC  and  ASCII  codes,  respectively, 
code  conversion  is  necessary.  In  order 


to  reduce  program  size  for  the  network 
control  program,  we  have  developed  a 
hardware  code  converter.  Two  memory 
addresses  are  assigned  to 

EBCDIC-to-ASCII  and  ASCII-to-EBCDIC 
converters  both  of  which  are  realized  by 
LSI  code  converters.  To  make  a code 
conversion,  we  just  need  to  store  a 
word  and  then  read  a word  at  the  same 
address.  Code  conversion  is  completed 
before  the  read  operation. 

A microprocessor-based  terminal 

switching  system  is  Introduced  to  access 
computers  in  the  LABOLINK  from  terminals 
distributed  in  our  laboratory.  We  have 
currently  four  terminals;  a DEC  Writer, 
a Teletype  ASR-33,  a Casio  InkJet 
typewriter  and  a Diablo  HyType  I. 

3.  CONFIGURATION  OF  MICROPROCESSOR  - 
BASED  TERMINAL  SWITCHING  SYSTEM 

Fig. 2 shows  a configuration  of  the 
microprocessor-based  terminal  switching 
system.  Two  computer  Interfaces  for 
the  PDP-11/AO  and  four  terminal 
Interfaces  are  currently  connected  to 
the  system.  All  these  Interfaces  will 
be  called  microprocessor  interfaces.  One 
computer  Interface  is  a console 
interface  supplied  by  DEC  and  anot1;er 


computer  interface  Is  a console 
Interface  supplied  by  DEC  and  another 
Interface  Is  a multi-terminal  interface. 
This  is  designed  to  work  as  three 
independent  Interfaces  when  connected  to 
the  microprocessor-based  terminal 
switching  system  while  it  alone 
corresponds  to  one  interface.  The  four 
terminals  are  a DEC  Writer  (30 
characters/sec , capital  letters  only),  a 
Teletype  ASR-33(10  characters/sec, 
capital  letters  only,  PTR,  FTP),  a Casio 
Inkjet  typewriter  ( 30  characters/sec, 
upper  and  lower  case  letters,  low  noise, 
PTR,  PTP)  and  a Diablo 

HyType  I (30  characters/sec,  upper  and 
lower  case  letters,  plotter 

capability).  Character  display 

terminals  will  be  connected  to  the 
system  in  the  future.  The 

. - -.-microprocessor  is  TOSHIBA'S  TLCS-12 
which  is  a 12-bit  processor.  The 
functions  of  the  terminal  switching 
system  are  realized  by  a program  stored 
In  2k-word  ROM (read  only  memory)  and 
Ik-word  RAM  (random  access  memory).  In 
order  to  realize  the  objectives  of  the 
'system  described  in  Section  1 each 
terminal  has  the  following  four  mode. 

(1)  Normal  mode  : Signals  received  from 
one  terminal  or  computer  interface  are 
transmitted  to  another  terminal  or 
computer  interface  according  to  the 
connection  registers  of  the 

microprocessor.  Each  terminal  or 
computer  Interface  is  at  the  normal  mode 
when  its  power  is  on.  Before  it  become 
off  each  terminal  user  must  type  %E  in 
order  to  erase  all  connections  which  was 
set  up  by  this  microprocessor 
interface . 


(2)  Conversation  mode  : Terminal  users 

can  communicate  with  each  other  by 
operating  the  sender's  terminal  in  the 
conversation  mode.  Usually  the 

receiver's  terminal  is  in  the  normal 
mode.  If  the  receiver  is  printing 
important  results  and  he  does  not  want 
to  be  interrupted  (for  example,  program 
list  printing),  the  user  of  the  terminal 
can  mask  the  conversation  request.  In 
this  case  only  the  BELL  signal  of  ASCII 
code  is  transmitted  to  the  receiver. 

Since  some  operations  cannot  be  realized 
by  terminals  which  are  not  connected  to 
the  console  interface,  there  is  more 
demand  for  the  console  interface.  Thus 
a terminal  connected  to  the  console 
interface  cannot  mask  conversation 
requests.  Different  from  the  normal 
mode,  input  characters  are  echo-backed 
during  the  conversation  mode.  The 
command  JC  changes  the  terminal  from  the 
normal  mode  to  the  conversation  mode  and 
JN  resets  from  the  conversation  mode  to 
the  normal  mode. 

(3)  Connection  setting  and  display 

mode  : Connections  between  two 

raicropreessor  interfaces  can  be  set  up  i 

by  any  terminal  or  computer  interface. 

So  before  the  terminal  power  is  off  (or  j 

before  the  end  of  the  program  which  set  \ 

up  the  connection)  JE  must  be  sent  to  j 

the  processor  in  order  to  erase  j 

connections  determined  by  the  terminal 
or  the  computer  interface.  In  order  to  i 

control  several  terminals  simultaneously 
by  a high-level  language  program,  two 
kinds  of  connections  called  real 
connections  and  virtual  connections  are 
used.  If  a terminal  or  a computer 
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Fig. 2 A microprocessor-based 

terminal  switching  system 


Interface  A has  a real  connection  to  setting  and  display  node, 

another  terninal  or  computer  interface  (b)  Input  a two-way  connection  corar.and 

B,  signals  given  fren  A are  transmitted  which  will  establish  a two-way 
to  B.  If  the  connection  is  virtual  no  connection  between  the  terminal  and  oqe 

such  signal  transmission  occurs,  but  computer  interface.  If  it  fails  (i.e. 

the  other  micropreessor  interface  cannot  the  computer  intrfaoe  is  occupied), 

make  a connection  (real  or  virtual)  to  error  message  is  printed,  otherwise  go 

B.  When  we  want  to  control  several  to  (e). 

terminals  simultaneously  by  a high-level  (c)  Input  a display  command  in  order  to 

language  program,  first  we  sot  up  know  which  terminal  is  using  the 

virtual  connections  from  the  computer  computer  interface. 

interface  used  by  the  program  to  the  (d)  Change  the  node  of  the  terninal 

terminals  to  be  controlled.  Every  time  into  the  conversation  mode  in  order  to 

before  controlling  one  of  these  communicate  with  the  computer  interface 

terminals  the  mode  of  the  computer  user.  V-’ait  until  the  other  user 

Interface  must  become  the  connection  finishes  and  go  to  (a). 

setting  and  display  node  and  one  of  (e)  Input  "‘I  command  to  change  the  mode 

virtual  connections  should  be  changed  to  of  the  terminal  normal.  Start  to  use 

a real  connection.  Display  command  D the  terminal. 

is  used  to  print  the  current  status  of  (f)  After  finishing  the  job  input  fE 

the  connection.  Two  cennands  iA  and  iU  command  for  the  termination, 

change  the  microprocessor  interface  node 

■from  the  normal  mode  to  the  connection  To  send  commands  from  a computer  to  the 

setting  and  display  mods  and  vice  microprocessor  is  realized  by  ?.KI!!r  or 

versa,  respectively.  WRITE  statement  of  high-level 

languages.  Since  each  command  must  be 
, • (4)  Microprocessor  mode  : Contents  of  followed  by  a carriage  return  and  line 

the  microprocessor  memory  can  be  altered  feed,  each  command  should  be  written  in 

■ by  inputs  given  from  a microprocessor  an  independent  I/O  statement  like 

Interface.  This  node  can  be  used  by 

computer  interfaces  to  load  programs  to  PRINT  1000 

the  microprocessor.  A terminal  1000  FORMAT  (3H  JM,/) 

Interface  can  be  at  this  mole  only  when 

' no  other  terminal  and  computer  system  For  the  control  of  terminals  T1,...,Tn 

use  the  terninal  switching  system.  simultaneously  by  a high-level  language 

Command  changes  from  the  normal  mode  program  the  following  procedure  is 

to  the  nloroprocessor  mode  and  resets  used. 

to  the  normal  mode.  (a)  Create  virtual  connections  to  the 

terminals  T1,...,Tn  from  the  computer 
• Commands  related  to  the  mode  transition  interface  which  is  used  by  the 

are  summalized  as  Fig. 3.  program. 

In  order  to  set  up  a real  connection  (b)  When  it  is  necessary  to  transmit 

between  one  terninal  and  one  computer  signals  to  and  from  terminals,  p.nange 

Interface  the  following  procedure  can  be  the  corresponding  virtual  connections  to 

used.  real. 

It  Is  possible  to  print  out  identical 
(a)  Input  %A  In  order  to  change  the  messages  to  mere  than  one  terminal  by  a 

mode  of  the  terminal  to  the  connection  single  PRINT  statement. 
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I».  COHMAUDS  FOR  EACH  l■^ODE 

The  followins  connands  arc  used  at  the 
connection  setting;  and  display  node, 
where  i and  j are  octal  r.ur.bers 
(0£i,J<31)  correspond inc  to  the  na-ies  of 
microprocessor  interfaces.  Let  k be  the 
name  of  the  microprocessor  interface 
which  gives  the  following  connands. 


! 


^ 0 1 j 

; Create 

a 

one-way 

real 

connection 

from 

i to  j. 

• 0 J 

: Create 

a 

one-way 

real 

t' 

connection 

from 

k to  j. 

R J ; Create  a virtual  connection 

between  k and  J. 

S i : Erase  a virtual  connection 

between  k and  J. 

N 1 : Erase  all  connections 

(one-way,  two-way,  virtual)  to 
and  from  1. 

D 1 : Display  all  nunbers  of 

microprocessor  interfaces  which 
have  connections  with  i. 

D : Display  all  connections. 


i 

i 


li 


r 


*1.  COMMAHDS  FOR  EACH  MODE 

The  foil  owing  connands  are  used  at  the 
connection  setting  and  display  node, 
where  i and  j are  octal  nur.bers 
corresponding  to  the  names  of 
microprocessor  interfaces.  Let  k be  the 
name  of  the  microprocessor  interface 
which  gives  the  following  connands. 

0 1 J ; Create  a one-way  real 

connection  from  i to  j. 

0 J : Create  a one-way  real 

connection  from  k to  j. 

P 1 J ; Erase  a one-way  real  connection 
from  1 to  J. 

P J ; Erase  a one-way  real  connection 
from  k to  j. 

T 1 J ; Create  a two-way  real 

connection  between  i and  J. 

T J : Create  a two-way  real 
connection  between  k and  J. 

U i J : Erase  a two-way  real  connection 
between  i and  j. 

0 J : Erase  a two-way  real  connection 
between  k and  j. 


R j : Create  a virtual  connection 

between  k and  J. 

S J : Erase  a virtual  connection 

between  k and  j. 

N i ; Erase  all  connections 

(one-way,  two-way,  virtual)  to 
and  from  1 . 

D 1 : Display  all  nur.bers  of 

microprocessor  Interfaces  which 
have  connections  with  i. 

D : Display  all  connections. 

The  following  commands  are  used  at  the 
conversation  mode. 

C J : Conversational  interrupt  for 
Interface  j. 

V : Inhibition  of  any  conversation 

request  for  k (it  is  effective 
when  it  is  not  connected  to  the 
PDP-11/ilO  console  interface),. 

W : Permission  of  conversation  for 

k (it  is  used  after  V is 
valid ) . 

In  order  to  treat  mistakes  occured 
during  command  and  conversaticn  message 
input,  the  following  commands  are 
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prepared . 
. BS(Back 

Space ) 

: One 

correction. 

character 

1 

: Erase 
In 

line. 

all 

the 

characters 

current 

Two  kinds  of 
prepared  . for 

Internal 

each 

registers  are 
nicroproccssor 

interface  to  realize  the  functions  of 
the  terminal  switchin.’  system.  These 
are  called  connection  registers  and 
command  registers.  Bit  assignments  for 
connection  registers  are  shown  in 
Fig. 4. 

5.  APPLICATIOIIS 

Typical  applications  of  the  system  are 
as  follows: 

(1)  Simultaneous  use  of  computers  in 
the  LABOLIN'K  from  terminals  located  in 
the  laboratory:  Since  the  fiT-11 
monitor  has  a foreground  and  background 
capability,  one  backgroung  job  and  up  to 
eight  foreground  Multi-oser  BASIC  jobs 
can  be  executed  simultaneously. 

For  the  simultaneous  use  of  computers, 
communication  programs  are  written  in 
Multi-User  BASIC  by  adding  statements 
for  communication  control. 

These  statements  are  chiefly  as  rcllows: 
i)  a statement  for  moving  the  contents 
of  a specified  address,  including  I/O 
registers,  to  a variable  in  BASIC  or 
vice  versa,  ii)  a statement  for 
assembling  a data  block  tran.smitted  from 
the  FACOM  and  putting  it  into  a BASIC 
string  variable,  iii)  two  statements 
for  message  exchange  with  the  HITAC  and 
a statement  for  setting  up  an  interrupt 
vector  address.  For  example,  '.‘.'OBDC  A , 3) 
assigns  the  content  whose  addreos  is  the 
first  argument,  to  the  second  argument. 
TXTR(A$,C)  receives  a data  block  from 


the  FACOM  and 
variable  A$, 
response  is 
predetermined 


assigns  it 
where  Cs1 
received 
time  period 


to  a string 
shows  that  no 
during  the 


The  following  programs  are  prepared  in 
computers  In  the  LABOLKJK. 

PDP:  Small  programs  for  research  aid. 
Game  playing  programs.  A manuscript 
preparation  system. 

HITAC;  A PL/I  cross  assembler  for  the 
terminal  switching  microprocessor. 
Programs  for  data  base  research.  Cn-line 
information  retrieval  systen  of  papers 
in  computer  science. 

FACOM;  Programs  for  automata  theory, 
logic  design  and  combinatorial  theory 
which  can  be  used  from  LABOLIiS'K 
terminals. 


A typical  simultaneous  usage  of  the 
system  is  as  follows.  One  user  prepares 
a FORTRAN  program  as  a background  job  of 
the  PDP,  while  one  or  two  other  users 
use  the  PDP,  the  FACOM  or  the  HITAC 
under  foreground  Multi-U.ser  BASIC 
programs . 

(2)  A manuscript  preparation  system: 

As  is  the  case  of  most  application 
programs  of  the  LAEOLI^IK  which  are 
written  in  BASIC  or  FCrTHA.’J,  kigh-lovel 
language  control  of  a high  resolution 
typewriter  Diablo  KyType  I is  possible. 
It  is  achieved  by  developing  a control 
hardware  which  can  be  connected  to 
conventional  typewriter  interfaces  of 
computers.  Its  control-and-status 
register  can  be  altered  by  ASCII  control 
characters  cr  special  comblnatior.s  of 
printable  characters.  Pro.trams  of  the 
manuscript  preparation  system  and  simple 
picture  drawing  system  are  written  in 
Multi-User  BASIC.  This  paper  is  prepared 
by  the  manuscript  preparation  system.  An 
example  of  a figure  produced  by  a si.r.ple 
picture  drawing  system  written  in  BASIC 
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Fig. 5 An  example  of  a figure  oroduced  by 
cl  s'..-iplc  ,"!Ctiirc  sy;-.-..-: 


J 


309 


Is  shown  in  Fig. 5. 
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editor  with  restricted  capability 
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The  system  uses  two  terminals,  one  for 
output  printing  and  the  other  for 
control  messages.  Usage  of  two  terminals 
is  realized  by  the  multi-terminal 
control  capability  of  the  terminal 
switching  system. 

6.  CONCLUDING  REMARKS 

A microprocessor-based  terminal 
switching  system  seems  to  be  a 
reasonable  method  for  realizing  a 
minicomputer  with  many  remote 
terminals.  However,  for  the 
simplicity  of  the  system,  control  for 
resource  sharing  is  left  to  the  users. 

It  will  not  be  a serious  problem  if  only 
one  computer  is  connected  to  the 
system.  We  are  planning  to  expand  the 
LABOLINK  by  introducing  another  computer 
in  our  laboratory.  Both  the  PDP  and 
the  new  computer  will  be  connected  to 
the  terminal  switching  system  and  thus 
further  extension  of  the  terminal 
switching  system  will  be  required. 
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